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REDACTED 



The Mnemosyne architecture defines a compatible framework for a family of 
implementations with a range of capabilities. The following implementation- 
defined parameters are used in the rest of the document in boldface. The value 
indicated is for MicroUnity's first Mnemosyne implementation. 
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Interfaces and Block Diagram 



Mnemosyne uses two Hermes unidirectional, byte-wide, differential, packet- 
oriented data channels for its main, high-bandwidth interface between a memory 
control unit and Mnemosyne's memory. This interface is designed to be 
cascadeable, with the output of a Mnemosyne chip connected to the input of 
another, to expand the size of memory that can be reached via a single set of data 
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channels. An external memory control unit is in complete control of the selection 
and timing of operations within Mnemosyne and in complete control of the timing 
and content of information on the high-bandwidth interfaces. 

A Cerberus bit-serial interface provides access to configuration, diagnostic and 
tester information, using TTL signal levels at a moderate data rate. 

Mnemosyne contains additional interfaces to conventional dynamic random- 
access memory devices (DRAM) using TTL signals. Each M&emosyne device 
contains output signals to independently control four banks Mi^lM memory; 
each bank is nominally 9 bytes wide, and connects to a singje^^of bidirectional 
data interface pins. Each DRAM bank may use 24-bit adjle^^, to handle up to 
16M-"worcT DRAM memory capacity (such as l^x^, organized, 64-Mbit 
DRAM). Up to four banks of DRAM may be connected to each Mnemosyne 
device, permitting up to 0.5 Gbyte of DRAj| per i " 

Nearly all Mnemosyne circuits use a 
Volts (5% tolerance). A second yT"" 
TTL interface circuits. Power 
Automated Bonding). 

Pin assignments are to 
power, 5.0V power anc 



r fge, nominally at 3.3 
ranee) is used only for 
>ackaging is TAB (Tape 



466 pins for 3.3V 




44 Intemal circuit documentation names this signal VDDO. 
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The following is a diagram of the Mnemosyne device interfaces: -(Numerical values 
are shown for MicroUnity's first implementation.) ... 




Atxolute Makirjum Ratings^ 


MIN . 


NOM 


MAX 


UNIT 





















































Recommended operating conditions 


MIN 


NOM 


MAX 


UNIT 


REF 


V-p Termination equivalent voltage 


4.5 


5.0 


5.5 


V 




Main supply voltage VDD 


3.14 


3.3 


3.47 


V 


VSS 


TTL supply voltage VCC 


4.75 


5.0 


5.25 


V 


VSS 


Operating free-air temperature 


0 




70 


C 
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%>UT: Output or input-output 
capacitance, SD, An..o 3 0 > RAS3..0, 
CAS 3 ..o, WE3..Q, DQ71..0" 
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Switching characteristics 


miaafe^aPMi 


tBc: HiC clock cycle time 


1000 


■■1 


Hi 


ps 


*BCH- HiC clock high time 


400 






ps 


^BCL- HiC clock low time 


400 






ps 


tRT HiC nlork transition timo 

L D 1 • 1 11 v LI «l loiUUI 1 IIJ lie 






100 


ps 


tR<V Set-UD time Hl7 n VflliH tn HiP yitirm 

l DO' wwl v.* yj utile?, I II/. .{J VdllU IKJ ntO AIIIUII 


200 




. 100 


ps 


tBH : hold time, HiC xition to Hi7 q invalid 


-200 




-100 


ps 


tos- skew between HoC and H07 0 


-50 




50. 


ps 


tc: SC clock cycle time 


50 






m 


tcH: SC clock high time 


20 






ns 


tcij SC clock low time 


20 




H 


ns 


tr: SC clock transition time 








ns 


ts: set-up time, SD valid to SC rise t 









ns 


tn: hold time, SC rise to SD invalid 






- 


ns 


top: SC rise to SD valid a % "% 


, :.5 






ns 



Logical and Phvsid 




„..^ w _.ted by an on-device 
^rd J^Ml^memory devices, and a 
on- device rcul-only and read/write 
„ ^tfrfacej;;vthe Hermes channel 
^md ^e Cerberus serial interface used to 
are keptg^feic'ally separate. 



iy ofl 8A words of size W bytes, 
jces all bytes of a single block. 
v block. 
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Logical memory organization 



Mnemosyne's DRAM memory physically consists of one or more banks of 
™ a ?J eX f d ~ addreSS DRAM me mory devices. A DRAM bank consists of a set of 
DRAM devices which have the corresponding address and control signals 
connected together, providing one word of W bytes of data plus ECC information 
with each DRAM access. 

Mnemosyne's SRAM memory is a write-back (write-in) single-set (direct-mapped) 
cache for data originally contained in the DRAM memory. All accesses to 
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Mnemosyne memory space maintain consistency between the contents of the 
cache and the contents of the DRAM memory. 

Mnemosyne's configuration region consists of read-only and read/write registers. 
The size of a logical block in the configuration memory space is eight bytes: one 
octlet. 

Communications Channels 

High-bandwidth 

Mnemosyne uses the Hermes high-bandwidth 
implementing a slave device. 

Mnemosyne operates two Hermes hij 
input channel and one output channj 

Mnemosyne uses the 
serves as the Hermes- 
corresponds to the Herm< 

Configuration-region 
the byte-wide inpul 
This mechanism 
channels, or sei 
device-to-dei 

Serial 

A Cerberus serial bus 
diagnosti 
part wh' 




to detect skew in 
wide output channel, 
ty^adjust for skew in the 
as may arise in 



■e the Mnemosyne device, set 
and to enable the use of the 



The DRAM interface uses TTL levels to communicate with standard, high- 
capacity dynamic RAM devices. The data path of the interface is 8W+ e bits. The 
DRAM components used may have a maximum size of 2 2p words by k bits, where 
the minimum value of k is determined by capacitance limits. (Larger values of k, 
up to 8W+e, meaning fewer components are required to assemble a word of 
DRAMs, are always acceptable.) 
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Error Handling 



Mnemosyne performs error handling compliant with Hermes architecture. 
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For the current implementation, the following errors are designed to be detected 
and known not detected by design: ■ ■ 



errors detected 


errors not detected 


invalid check byte 


invalid identification number i 


invalid command 


internal buffer overflow 


invalid address 


invalid check byte on idle packet 


uncorrectable error in SRAM cache 




uncorrectable error in DRAM memory 





Detection of an uncorrectable error in either the SRAM cairfcor the DRAM 
memory results in the generation of an error response pj^bl^^nd other actions 
more fully described elsewhere. . ' 




Upon receipt of the error response pack 
status register of the reporting device fef c 
Mnemosyne devices reporting an 1 
additional packets until the erro% 
However, such devices may conj 
received, and generate responds 
clearing the error, the pacj^pfi 
commands. 

Because of the larg< 
channel and the,, 
after detectinj 
Cerberus wJ 
queue of out! 
via Cerberus, 
the Mnemo: 



_>r must read the 
ft ure of the error, 
ess the receipt of 
the status register, 
'hich have already been 
iatg^orrective actions and 
:se1!q|any unacknowledged 

high-bandwidth Hermes 
safe to assume that, 
status register via 
Editions and that the 
ing the status register 
sending requests to 



fsters l&mply w ith the Cerberus and Hermes 
^>tions^pbnfiguration registers are internal read/only and read/write 
Jtrs which provide an implementation-independent mechanism to query and 
ntrol the configuration of a Mnemosyne device. By the use of these registers, a 
user of a Mnemosyne device may tailor the use of the facilities in a general- 
purpose implementation for maximum performance and utility. Conversely, a 
supplier of a Mnemosyne device may modify facilities in the device without 
compromising compatibility with earlier implementations. 

Read/only registers supply information about the Mnemosyne implementation in a 
standard, implementation-independent fashion. A Mnemosyne user may take 
advantage of this information, either to verify that a compatible implementation of 
Mnemosyne is installed, or to tailor the use of the part to conform to the 
characteristics of the implementation. The read/only registers occupy addresses 
0..5. An attempt to write these registers may cause a normal or an error response. 
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Read/write registers select the mapping of addresses to SRAM and DRAM banks, 
control the internal SRAM and DRAM timing generators, and select power and 
voltage levels for gates and signals. The read/write registers occupy addresses 
6..11,16..19,and32. 

Reserved registers in the range 12.. 15, 20.31, and 33. .63 must appear to be 
read/only registers with a zero value. An attempt to write these registers may 
cause a normal or an error response. 



Reserved registers in the range 64..2 16 -1 may be implement* 
registers with a zero value, or as addresses which cause an 
or writes are attempted. 



The format of the registers is described 
Cerberus address of the register; bits in< 
The value indicated is the hard-wired, 
and is the value to which the rej 
register. If a reset does not initis 
required by this specification, a 
range is the set of legal valf 
interpretation is a brief de^Ci 
more comprehensive df^~ * 

octlet bits 
0 63.1 




iti^as read/only 
ponse if reads 



nw* The octlet is the 
> .fctff^Jeld m a register. 
e^Br^read/only register, 
^J^set for a read/write 
tr if initialization is not 
to the value field. The 
ter may be set. The 
if the register field; a 



octjjtf 



bits 
63, 1. 




limplementor 
^ code 


3x00 

40 

a3 

24 

6d 

f3 




Identifies Mnemosyne Memory 
device as implemented by 
MicroUnity. 


implementor 
revision 


0x01 
00 




Implementation version 1.0. 
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octlet bits 
2 63=. 16 



manufacturer 
code 


0x00 

40. 

a3 

92 

b6 

79 




Identifies initial manufacturer "of 
Mnemosyne Memory device 
implemented by MicroUnity, 


manufacturer 
revision 


0x01 
00 




Manufacturing version 1.0. 



3 63..16 


serial 
number 


0 




This device haCndtslrial number 
capability.. % '% 




dynamic 
address . 


0 




|his devj^#^^no^ynamic 
sldressina capability? ' 


octlet bits 


field name 


value. 


/anas 


tetietoretation 




0..15 



Number of fidundant blocks per 
diviS|^%A zero value signifies 16 

int blocks. 

iber of row and column address 
interface pins 



0..15 J Maximum value by which column 
address pin count may be less than 
row address pin count. 



log2 of number of banks of DRAM 
expansion 



0..15 log2 of maximum interleaving level 
in DRAM interface. 



Reserved for definition in later 
revision of Mnemosyne architecture 



bits 
63.0 



field name value range 



:eserved for definition in later 
tvision of Mnemosyne architecture 
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octlet bits 
6 63 
62 
61 

60 
59 



field name value rj 



clear 



selftest 



isolate/ 
synch 



ECC seed 



cidle 0 



0..1 



0..1 



0 0. 



set to invoke device's circuit reset 



set to invoke device's selftest: bits 
60., 48 may indicate depth of selftest 
set to invoke tester mode 



tester mode: if set, suppress cache 
misses/writebacks. 
tester mode: synch 



interpretation 




.25 value to modify ECC code computed 
on incoming data. Used to exercise 
ECC detection/correction logic, or to 
write arbitrary patterns into memory. 



.25 Value transmitted on idle Hermes 
output channel when output clock 
zero (0). 



cidle 1 |255 j0.. 25 (Value transmitted on idle Hermes 
output channel when output clock 
jone (1). 
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field name ^""f value' range 



reset/clear/ 
selftest 
complete 



reset/clear/ 
selftest 
status 



check byte 



locatic 
flag 
dirty flag 



ECC 
syndrome 
extension 



ECC 
syndrome 



0..1 



0..1 



This bit is set when a reset, clear or 
selftest operation has been 
completed 



This bit is set when a reset, clear or 
selftest operation has been 
completed successfully. 



This bit is set when a received input 
packet has an incorreM check byte. 



interpretation 




0.:25 Value of syndrome encountered on 
previous correctable or 
uncorrectable ECC error. 



if ECC. error was in cache memory, 
1 if ECC error was in DRAM memory. 



Dirty bit if error was in cache memory 



extend ECC syndrome value when e 
>8 



0..25 value sampled on Hermes input 
".hannel when input clock is zero (0). 



255 0..25 Value sampled on Hermes input 
£ channel immediately following 
sample value in raw 0 register. 
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0 


3 




Reserved for handling larger address 
spaces. 


ECC addr 


3 


0..2 3 
2-1 


Address at which an ECC error was 
detected. 



octlet bits 
9 63..60 

59..S6 
55.. 52 
51. .48 



field name value range 



log2id 



0 |0.. I Number of DRAM interleaving levels 
can be computed as Hl=JH ,0 9 2W . 



interpretation 



iter RAS 



fie relative to CAS 



t5 0*Y 0 ^ a i' e m d cycte time is t3+t4+t5, 
jpags mode oAS ftrecharge is t3+t5 




1 column 



Slative to RAS 



pset%> for refresh cycle. 
% ensufj*B-AS precharge is 



tccupied from end of 



put data on bus from start of 



I between two address bus 
insitions 



If set, generate refresh cycles. 



octlet 
11. .15 



0..12 Interval between refresh cycles. 



bits field name value range interpretation 

63 -° I 0 P P Reserved 
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pctlet bits 
16- 63. 56 



field name. : .^ f value range 
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analog skew 0xc2 0. 
bit 5 



analog skew 0xc2 0. 
bit 4 



23.. 16 



field name value r; 



analog skew 0xc2 0. 
bit 7 



analog skew Dxc2 0. 
bit 6 



analog skew 0xc2 0 
bit 3 



analog skew 0xc2 0. 
bit 2 



15..8 analog skew 0xc2 D 
bit 1 " 
7. 0 analog skew Dx 



bitO 



25 



25 Set power and voltage swing levels 
in Ho6 skew delay circuits. 



.25 Set power and voltage swing levels 
in Ho5 skew delay circuits. 



25 Set power and voltage swing levels 
in Ho4 skew delay cirl 



interpretation __ 

Set power and voltage swing levels 
in Ho7 skew delay circuits. 



.25 Set power and \ 
in Ho3 skew c 



fring levels 



|tage swing levels 
iayjgircuits. . 



lage swing levels 
pay circuits. 



octlet bits 
18 63.. 56 



field name .„ 



SRAM pipe Oxc: 



*c2 0. 



age swing levels 
circuits. 



ge swing levels 
^'circuits. 



er and voltage swing levels 
'base detector circuits. 



ower and voltage swing levels 
acket forwarding logic circuits. 



Set power and voltage swing levels 
in packet forwarding PLA. 



n packet forwarding PLA. 

.25 Set power and voltage swing levels 
in tester logic circuits . 
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octiet bits 
19 63.-56 

S5..48 

47.. 40 

39.. 32 

31. .24 

23.. 16 

15..8 

7..0 



tester PLA 


0xc2 


0..25 
5 


Set power and 
in tester PLA. 


voltage swing levels 


dual port 
RAMs 


0xc2 


0..25 
5 


Set power and voltage swing levels 
in 2-port RAM circuits. 


big PLA 


0xc2 


0..25 
5 


Set power and 
in big^PLAs. 


voltage swing levels 


small pla 


0xc2 


0..25 
5 


Set power and 
in small PLAs. 


voltage swing levels 


pipeline 
interface 


0xc2 


0..25 
5" 


Set power and voltage swing levels 
in pipeline int&lftfcetfcircuits. 


other logic 2 


0xc2 


0..25 
5 


Set power anicf 
n other logic t 


v^age swing levels 
ircuits. 


other logic 1 


0xc2 


0..2S, 




swing levels 


other logic 0 


0xc| 


0. 25 
t. . 




^©Jfage swing levels 
circuits. 



octiet bits field name 
20..31 63..0 [~ 



octiet bits 
32 63..56 

55.. 48* 

. 47..40 

39.. 32? 



°4 y t: 

, - ■ 



redundan^O 



redui 



*25 



•nable; 
iork 



f address for redundant 
artition 0) 



dress for redundant 
|n0) 



Enlble and^address for redundant 
block'%„(partition 1) 



En< 



id address for redundant 
(partition 1) 



served for use with additional 
redundant blocks. 



bits 
63..0 



octiet bits 
64.. 63..0 
65536 



0 p 


P 


Reserved for use with additional 
redundant blocks. 


field name value range 


interpretation 



Reserved for use with later revisions 
' the architecture. 



configuration memory space 



3 
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Identification Registers 

The identification registers in octlets 0.3 comply with the requirements of the 
Cerberus architecture. 
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MicroUnity's company identifier is: 0000 0000 0000 0010 1100 0101. 

MicroUnity's architecture code for^ Mnemosyne is specified by the following table: 



Internal code name 


Code number 


Mnemosyne 


0x00 40 a3 49 d2 e4 



Mnemosyne architecture revisions are specified by the following table: 



Internal code name 


Code number ^ 


1.0 


0x01 00 ^ is\ 



MicroUnity's Mnemosyne implementor codes are s] 



Internal code name 



MicroUnity 



MicroUnity's Mnemosyne, as i 
codes as specified by the foll< 




Jby the following table: 



fed by the foxing table; . ^ ^ 
Internal codfe name 1 1 Rg»$&^urqker % 
1.0 0X0 1 00 . 



:y, uses implementation 



MicroUnity's Mnemosyne, as w>[ lemented by Micro Unity, uses manufacturer 
codes as specified by the tblWrn,? table 



totirnate^e nage 



iber 



a3 92 b6 79 



:y's^|ilfemosyne, a^^fiplemenltd by MicroUnity, and manufactured by 
:rs, uses manufacturer revisions as specified by the following table: 



Internal code name 


Code number 


1.0 


0x01 00 
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Architecture Description Registers 

The architecture description registers in ocdets 4 and 5 comply with the Cerberus 
and Hermes specifications and contain a machine-readable version of the 
architecture parameters: A,W,C,N,D,R,P,K,E, and I described in this 
document. 
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Control R&atetnr 

The control register is a 64-bit register with both read and write access. It is 
altered only by Cerberus accesses; Mnemosyne does not alter the values written 
to this register. 

The reset bit of the control register complies with the Cerberus specification and 
provides the ability to reset an individual Mnemosyne device in a system. Setting 
this bit is equivalent to a power-on reset or a broadcast Cerberu& reset (low level 
on SD for 33 cycles) and resets configuration registers to tJieir,,pW6r-on values, 
which is an operating state that consumes minimal current.^Alipl^completion of 
the reset operation, the reset/clear/selftest complete bif<of t£e status register is 
set, and the reset/clear/selftest status bit of the status re^istlm is set. 



The clear bit of the control register com&ges 
provides the ability to clear the logj^fa 
system. Setting this bit causes all 
required after reconfiguring po$r " 
reset operation, the reset/cleai 
and the reset/clear/selftest j 

The selftest bit of t 
and provides the ah' 

a system. However, Mnemosyne does no|$efine 
so setting this t ' * " 
reset/clear/sej" 



Serbi^^^pecification and 
1 cmosyne device in a 
gic to be reset, as is 
: completion of the 
: status register is set, 



the < 
Pthe s 



^^.Cerberus specification 
Mnemosyne device in 
test mechanism at this time, 
efftes^complete bit and the 



The tester bit of 



kdie 



as a componenj 
using the " 
cleared. 



to use a Mnemosyne part 
forlr Mnemosyne or other part 
ial operation this bit must be 
is configured as either a signal 
„ of the source bit of the control 
e>parU5„ are <H$cM&d to perform the signal source or 
wthey-the i§§Jate/synch bit is set, a synchronization 
ted on the Hermes output channel and received on the Hermes 
thannel t6 synchronize the cascade of four Mnemosynes; the isolate/synch 
. must be turned off starting at the end of the cascade to properly terminate the 
synchronization operation. 

When not in tester mode, the isolate/synch bit of the control register is used to 
initialize the SRAM cache and perform functional testing of the SRAM cache. This 
bit must be cleared in normal operation. Setting this bit and setting the ECC 
disable bit of the control register suppresses cache misses and dirty cache line 
writebacks, so that the contents of the SRAM cache can be tested as if it' were 
simple SRAM memory. A read-allocate command returns the octlet data from the 
SRAM cache entry that would be used to cache the requested location, the data is 
unconditionally returned, regardless of the contents of the tag, dirty and ECC 
fields of the SRAM cache entry. A read-noallocate command returns an octlet in 
the following format: / / 
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63 626i 6Q 48 47 

IdlOj tag T 



A write-allocate command writes the octlet data, along with the dirty bit set, the 
tag corresponding to the requested location, and valid ECC data into the SRAM 
cache entry that would be used to cache the requested location. A write- 
noallocate command writes the octlet data, along with the dirty bit cleared, the tag 
corresponding to the requested location, and ECC data as if thel^jrt^ bit was set, 
into the SRAM cache entry that would be used to cache the^qgllted location. 
The ECC seed field of the control register can be set to alter^e ECC data that 
would otherwise be written to the SRAM cache enfcr%^ tSV "write completely 
arbitrary patterns, or to write patterns in which the <! d|efy Bit is cleared and the 
ECC data is value. 




The ECC disable bit of the contra 
errors in the SRAM cache and in 
normal operation of Mnemosyn 



The module id field 
Mnemosyne. The moduk- address dd 
Mnemosyne will select* 



le to ignore ECC 
it may be set during 



lodule address for 
module addresses 



le internal clocking of 
direMy. This bit is cleared 



Setting the PLL 
the high-bandwj' 
during norm; ' 

The PLL range ^d^^^i^i^^ol rea^p^used^^elef t an operating range for 
the internal PLL A thr< bit J .'cU is reserved focjthis function, of which one bit is 
currently defined; if the PJ L range is set to zefo, the PLL will operate at a low 
frequency (JSeW 0 xx4^H? ), ij|the v PXL ranges set to one, the PLL will operate 
at a lughv/^quehoys^abdvi 




telds d^^#contrc%. register set the slew rate for the TTL 
DRAM control, address and data signals, as detailed in a 



Mnemosyne uses a sufficiendy high-frequency clock that internal SRAM timing 
can be controlled by synchronous logic, rather than asynchronous or self-timed 
logic. Internal SRAM, timing may be controlled by loading values into configuration 
registers. The current specification reserves four bits for control of SRAM timing; 
one is currently used. 

The SRAM timing bit is normally cleared, providing internal SRAM cycle time of 4 
clock cycles. Setting the SRAM timing bit extends the cycle time to 5 clock cycles. 

The ECC seed field of the control register provides a mechanism to cause ECC 
errors and thus test the ECC circuits. The field reserves 12 bits for this purpose, 8 
bits are used in the current implementation. The field must be set to zero for 
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normal operation. The value of the field is xor'ed against the ECC value normally 
computed for write operation. 

The cidle 0 and cidle 1 fields of the control register provide a mechanism to 
repeatedly sent simple patterns on the Hermes output channel for purposes of 
testing and skew adjustment. For normal operation, the cidle 0 field must be set to 
zero (0), and the cidle 1 field must be set to all ones (255). , — 

Status Register 

The status register is a 64-bit register with both read and w,g 
only legal value which may be written is a . zero, to clear j 
writing a non-zero value is not specified. 

The reset/clear/selftest complete bit 

Cerberus specification and is set upo 
operation as described above 



The reset/clear/selftest status 

specification and is set upon 
operation as described al 



:, though the 
iter. The result of 





received input packet 
ired or forwarded to the 
generated. 

£t. ■ 

ceived input request 
as currendy configured. 



when a packet is. received on 
id, such as a read, write or 



The check byte 

has ah incorrect 
Hermes output 

The address 
packet has an 
An error respoj 

The comitii 
the Hen 

error r^fepl^e pi 

tic6rrr%ctjjjle ECC Mr bit of the status register is set on the first 
fence of ^n uncorrectable ECC error in either the SRAM cache or the 
1 memory. The ECC location flag is set or cleared, indicating whether the 
error was in the cache memory (cleared, 0) or the DRAM memory (set, 1). The 
ECC syndrome field of the status register is loaded with the syndrome of the data 
for which the error was detected. The ECC addr register is loaded with the 
address of the data at which the error was detected. An error response packet is 
generated. Once one uncorrectable ECC error is detected, no further correctable 
or uncorrectable ECC errors are reported in the status register until this error is 
cleared by writing a zero value into the status register. 

The correctable ECC error bit of the status register is set on the first occurrence 
of a correctable ECC error in either the SRAM cache or the DRAM memory, 
provided an uncorrectable ECC error has not already been reported. The ECC 
location flag is set or cleared, indicating whether- the error was in the cache 
memory (cleared, 0) or the DRAM memory (set, 1). The dirty flag indicates, for an 
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error in the cache memory, the value of the dirty bit. The ECC syndrome field of 
the status register is loaded with the syndrome of the data for which the error was 
detected. The ECC addr register is loaded with the address of the data at which 
the error was detected. Once one uncorrectable ECC error is detected, no further 
correctable ECC errors are reported in the status register until this error is 
cleared by writing a zero value into the status register. The occurrence of this 
error will cause a response packet to be generated with a "stomped" check byte 
pattern, but is not explicitly reported with an error response packet. 



The other error bit of the status register is set when errors note 
occur. There are no errors of this class reported by the curr^ 

The PMOS drive strength field of the status registei 
indicates the drive strength, or conductance ; 
Mnemosyne chip, expressed as a digital bir||ry v 
the power and voltage level conJfgurl&iQj 
characteristics of individual devices^/" 
table: 



ise specified 
lementation. 

read/only field that 
devices on the 
:ed to calibrate 
:ons in process 
" is given by the 




The PLL in range bit of the status register indicates that the Hermes input 
channel clock and the PLL oscillator are running at sufficiently similar rates such 
that the PLL can lock. This bit is used to verify or calibrate the settings of the PLL 
range field of the control register. 

The ECC location flag bit of the status register, described above, indicates the 
location of an uncorrectable ECC error of a correctable ECC error. If the bit is 
set, the error was located in the DRAM memory, if the bit is clear, the error was 
located in the SRAM cache memory. 
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The dirty flag bit of the status register, described above, exhibits the dirty bit read 
from cache memory that results in an uncorrectable ECC error or correctable 
ECC error. The value is undefined if the currently reported ECC error was read 
from DRAM memory. ... 

The ECC syndrome field of the status register, described above, exhibits the 
syndrome of an uncorrectable ECC error or correctable ECC error. A 12 -bit field 
is reserved for this purpose; the current implementation uses eight bits of the 
field. The values in this field are implementation-dependent. 



ECC syndrome values representing single-bit errors %%feoUnityV first 
implementation are detailed by the following table. Entrie^fVire not covered by 
the ECC code; syndrome values not shown in this tabl^ar^ uncorrectable errors 
involving two or more bits. 




J le l^^ l W staC %^ffster contain the values obtained from 
|pen%|amp1es of the^fcies input channel. The raw 0 field contains a 
^ Btained^fien the input clock was zero (0), and the raw 1 field contains the 
^ s obtained on the immediately following sample, when the input clock was (1) 
nemosyne must ensure that reading the status register produces two adjacent 
samples, regardless of the timing of the status register read operation on 
Cerberus. These fields are read for purposes of testing and control of skew in the 
Hermes channel. ' 
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ECC Address R&gistnr 

The ECC addr register indicates the address at which an uncorrectable ECC 
error or correctable ECC error has occurred. Bits 63..2P+E of the ECC addr 
register are reserved; they read as 0. If the ECC location flag bit of the status 
register is zero, the ECC addr register contains the cache address in bits C-1..0 
and the uncorrected cache tag in bits 2P+E-1..C. 
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DRAM Address Mapping 

Mnemosyne may interleave up to 2 1 DRAM accesses in order to provide for 
continuous access of the DRAM memory system at the maximum bandwidth of 
the DRAM data pins. At any point in time, while some memory devices are 
engaged in row precharge, others may be driving or receiving data, and others 
may be receiving row or column addresses. In order to maximize the utility of this 
interleaving, the logical memory address bits which select the DRAM bank are the 
least-significant bits. %^ 



A logical memory address determines which bank of I 
and column of such an access, and which interleave s 
below shows the ordering of such fields in a general I 
addresses and field sizes shown are for a |our-byte/' 
two-way interleaved configuration of lM*^ird DRT 



L 



22 21 20 

FT 





:ssed, the row 
:ed. The diagram 
ronfiguration; the bit 
mejtapry address and a 



10 
[intj 



An access request which i; 
and int fields as a curj 
active request, at w^hjihffine tl 
mode access. Thjs^S^c%anjs% hi 
when the reqi 
latency aa 
advantage of 

Mnemosyne devices 
:ket foAtl 



the packei lon^ais.^ 
contiguous Audreys S] 
within ea«h%J$nem< 
controller shot U3 also infeerle 
adj icent addre- jes tre han< 



in the select, row, 
completion of the 
landled using a page 
bandwidth access even 
irovides for lower 
tly local to take 



jacity, using the ma field in 
take the mapping between a 
address spaces made available 
[urn performance, the memory 
Tress spaces so that references to 
tt devices. 
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DRAM Timing Onntml 



An internal state machine uses configurable settings to generate event timing, to 
accommodate DRAM performance variations. The timing of DRAM read cycles to 
a single DRAM bank is shown below: 
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The timing of a read cycle followed by a write cycle to a single DRAM bank is 
shown below: 




^ V 
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The time intervals shown in 'the* figures above control 4 the following events: - 



interval 


units 


meaning 


tl 


4 


Row address set up time relative to RAS. 


t2 


4 


Row address hold time after RAS. 1 


t3 


4 


Column address set up time relative to CAS. 


t4 


4 


CAS pulse width. The data bus is sampled for a read 
cycle at the end of t4. 


ts 


4 


Page mode cycle time is t3+t4+t5. t t 
Page mode CAS precharQe is t3+t5. 


t6 


4 


RAS precharge is t6+t1 . 


t7 


4 


CAS to RAS set up for refresh cycleW%&t1 to ensure 
RAS precharge is met. V, * 


t8 


4 


Time data bus assumed to be oc ^upied (by DRAM) after 
end of CAS low (end ot t 4) during read cycle. During t8, 
Mnemosyne will v,t drive CAS k k i a read from 
another DRAM bank os star! a write r-.cle to another 
DRAM bank ' - 


t9 


4 


Time data bus driven [by MnenosynaJ from column 

addf e>/R- rff'l^P (<?ffiff nf t^i Hlffind write r*i<Ho Plurinn +Q 

,J? w* 1 v uurjnrj yvt$to%pyuie. uunng xy, 
Mnemosyne will not drive CAS low for a redd from 

S!8S^nf v1 bank ' or start ' " ;,te ^ 6letoanother 


HO 




rvalbotwccr two address bus- transitions. During HO, 
Mnemosyne wsfl not change ffte address bus of another 
QRA^-banK fhis i mils the noise ge$$rated by slewing 


t11 


1024' 


interval between refresh cycles-% 



quesisjiysefore the corresponding DRAM 
V&lue until they can be processed. 
Mm -lower priority than DRAM reads, 
i 5 made to^gl an ad%ess that is queued for a write operation. 
I writes are processed until the matching address is written, 
nosyne may make an implementation-dependent pessimistic guess that such' 
a^onflict occurs, using a subset of the DRAM address to detect conflicts. The 
number of DRAM writes which are queued is implementation-dependent. 

Mnemosyne uses one address bus for each interleave because dynamic power and 
noise is reduced by dividing the capacitance load of the DRAM address pins into 
four parts and only driving one-fourth of the load at a time. A timer (tlO) prevents 
two address transitions from occurring too close together, to prevent power and 
noise on each address bus from having an additive effect. In addition, the loading 
of the already divided RAS, CAS, and WE signals is closer to the loading on the A 
signal when the address bus is also divided, reducing effects of capacitance 
loading on signal skew. 
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Power. Swing. Skew and Slew Calibration 

Mnemosyne uses a set of configuration registers to control the power and voltage 
levels used for internal high-bandwidth logic and SRAM memory, to control skew 
in the output byte-channel, and to control slew rates in the TTL output circuits of 
the DRAM interface. The details of programming these registers are described 
below. 

Eight-bit fields separately control the power and voltage levels used in a portion of 
the Mnemosyne circuitry. Each such field contains configtf|||ri^data in the 
following format: 




value 


voltage swing level 


0 




1 




2 




3 




4 




5 




6 




7 





Voltage swing level control field interpretation 
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Values and interpretations of the res field are given by the following table:' . - 




io each ov field, xxx in 
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The digital skew fields set the number of delay stages inserted in the output path 
of the HoC and the Ho7..0 high-bandwidth output channel signals. Setting these 
fields, as well as the corresponding analog skew fields, permits a fine level of 
control over the relative skew between output channel signals. Nominal values for 
the output delay for various values of the digital skew and analog skew fields are 
given below: 




When MnemosyEgjS resejj&a <a.§iai 
fields, setting a minimi i 0 ltptu d< I 



45 We need'WSv^^ht 




~» 0 iiH^ded into the digital skew 
the Ho^and'Hoy.-O signals. 

o get these nominal values. 



MU 0023438 



For evaluation only 



Highly Confidential 
- 226 - microunity confidential 



Terpsichore System Architecture 



REDACTED 



The output slope fields of the control register set the slew rate for the TTL 
outputs used for DRAM control, address and data signals, according to the 
following table: 




bdant physical memory blocks to 
fache memory. A systematic method for 
lescri 1 ! ' 
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To help clarify the following description, the figure below shows the logical 
arrangement of the physical memory blocks in the SRAM cache of MicroUnity's 
first Mnemosyne implementation. There are 40 physical memory blocks, each 
containing 2048 x 9 bits of data. The 40 blocks are divided into 4 banks of 10 blocks 
each. The 40 blocks are also divided into 2 partitions of 20 blocks each, and for 
each partition, there are two redundant memory blocks which can be configured 
to substitute for any of the 20 blocks in that partition. The 40 blocks are also 
divided up into 5 ranks, containing 8 blocks each, where each rank contains a 
distinct portion of a cache line. A cache line contains eight bytea^f data, a 13 -bit 
tag, a dirty bit, four unused bits, and an 8-bit ECC field. 

3T 



Legend: 



□ 



n 1 physical memory block in bank n 



I redundant memory block x 







LJJ 


jJLlJ 




















[Tfl 2 J l" 


■ 0 1 


^I 2 FIR 


mn 


0 t 


i:|'3 | 2 || 1 || o | 


H 


HI 


ill 





rank 4 
rank 3 
rank 2 
rankl 
rankO 
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arrangement of physical memory blocks 
for MicroUnity's first implementation 



Each redundant x field, where x is in the range 0..D*R-1, controls the enabling 
and mapping address for a single redundant block. Starting at Cerberus address 
32 and bits 63.. 56, each successive byte controls a redundant block, covering each 
redundant blocks in partition 0, and then in successive bytes, blocks for additional 
partitions. In other words, the redundant x field is located at Cerberus address 

32+j|, bits 63 -(x mod 8)..56-(x mod 8), and specifies the redundant mapping for 
block (x mod R), of redundant partition J^. The format of each redundant x field 
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is detailed in the following figure, with bit field sizes shown for MicroUnity's first 
implementation: • - - ; 

7 6 '5 4 2 1 "0 



jen| Q | ra | ba | 



redundant block controls 



The range of valid values and the interpretation of the fields^is given by the 
following table: %. ,#■ 



field 


bits 


value 


interpretation >^ 


en 


1 


0..1 


If set, use this regjr^an't block to 
replace a physplftgnernpry block. 


0 


7+Jlog2 (jj) 


0 




ra 


-Jog 2 (°) 




Replace physical roem'ory block at 
^nkra»witn the redundant block. 


ba 


1092 ^ M 




Replace physical memory block at 
b <nk ba -ir.r.p redundant block. 



Redundancy i 
bit if the control: 
with each red 
indicate the 1 
redundant blocl 
working redund 

In order to 
the intei 
Mnem^ 



feddr< 



^ie with the isolate/synch 
\zero, and then again 
,|tf the testing should 
Memory blocks and the 
ilcks is replaced with a 
Ids as required. 

^jflures to physical block failures, 
blocks must be elaborated. First, a 
,,~ar parts according to the following 
^Unity's first implementation: 

13 12 1 a 



T 



Mnemosyne cache address layout 



MU 0023441 



For evaluation only 



Highly Confidential 

.-229- microMnity confidential 



Terpsichore System Architecture 

The interpretation of the fields is given by the following table: 



REDACTED 



field 


bits 


interpretation 


0 


8A-2P-E 


Must be zero 


tag 


t 


These bits are stored into the cache 
on a write operation and compared 
against bits read from the cache on a 
read operation. 


ca 


C-log 2 (") 


These bits are applied to the^ 
physical memory block tc^|fe#f a 
single SRAM cache wc^dl!^ 


ba 


1092 (£) 


These bits are used,|f^s§P;t one of 
— banks of physi^^ernpry blocks. 



Mnemosyne cache ai 

For each cache address and cad 
cache tag, the cache data, and 
these fields is as shown in 
MicroUnity's first implemi 



•ormation, containing a 
Internal arrangement of 
big^field sizes shown for 




The interpretation oj the f^dg^ Wgiv^t^ the^folk)wii%table: 



field 


y?its. %\ | 


fet^preta|^)\ 






BpC bij|%&0d to correct single bit 
i#o#s^[%j\ietect multiple-bit errors. 






Da^fbits contain the visible cache 
data^as it appears in the packets. 


f d ^ 


'1 


Dirty bit: indicates that the cache 
line needs to be written to DRAM 
memory on a miss. 


u 


S*n-e-8W-1-t 


Unused bits pad cache line to even 
number of physical memory blocks. 


tag 


t 


Tag bits identify a Mnemosyne 
logical address for this cache line. 



Mnemosyne cache line field interpretation 
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From the tables above, for each failure identified in the cache SEAM, a' physical 
memory bank number, ba, can be identified from the Mnemosyne address, and a 
bit position, bi, can be identified from the Mnemosyne cache line layout. The bit 
position specifies a physical memory partition number, pa, according to the 
following formula: 

ibi mod s*D 
pa = j 



90 83 82 
1ECC 




Correct Mulure in the cache SRAM, one of the working redundant blocks in 
partition pa must be configured by setting a redundant x field, where x is in 
range pa*D+D-l..pa*D, to the value: 



. 7 f 
I 1 I 0 



r 



j 



redundant block controls 
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Multiple Memory Chips 

Up to four Mnemosyne memory devices may be cascaded to form effectively 
larger memories. The cascade of memory devices will have the same bandwidth as 
a single memory chip, but more latency. 




Packets are explicitly addressed to a particular Mnemosyne 
received on a device's input channel which specifies ai 
automatically passed on via its output channel. This 
serial interconnection of Mnemosyne (jevices inj 
identically to a single Mnemosyne, exqgrjl that 
memory capacity and longer response,iMtenf 




any packet 
todule address is 
provides for the 
which function 
string has larger 



and W parameters, in 

N1U 0023444 



All devices in a cascade must haj 
order that each part may pro] 

Response Paa 

In general, a 
response packet 
packet and the ' 
forwarding of| 
the cache, by 
presence of qu< 
configurable 



command causes a 
the end of the request 
the processing and 
requested word in 
id I)R!^^%mg generators, by the 
req11^tfr*as well as other non- 
cle^iee parameters. 

pconfigurable parameters and 
iemory controller may completely 
, impendence on such characteristics is 
ixcept ro%|f|ting an% characterization purposes. 

f accesses! DRAM accesses, and forwarded packets typically have differing 
latency before a response or forwarded packet is generated at the Hermes output 
channel, so that certain combinations would imply that two output packets would 
need to overlap. In such a case, Mnemosyne will buffer the later output packet 
until such time as it can be transmitted. However, the number of requests that 
can be buffered is strictly limited to eight (the number of identification numbers) 
per Mnemosyne device. It is the responsibility of the issuer of command packets to 
ensure the number of outstanding packets never exceeds the limits of the buffer. 
Mnemosyne may use non-fair scheduling for forwarded packets to avoid buffer 
overflow conditions. 

The use of DRAM page mode accesses and interleaving requires knowledge of the 
relationship between a pair of transactions. Therefore, additional DRAM requests 
per interleave level may be transmitted before the time at which the DRAM 
controller may perform the request. These additional requests are queued and the 
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corresponding response packet is generated at a time controlled by the DRAM 
timing generator. DRAM interleaves are serviced in an 
i to enst 
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Calliope Interface Architecture 

Portions of this section has been temporarily removed to a separate document: 
"Calliope Interface Architecture," though it is still a mandatory area of the 
Terpsichore System Architecture. 

MkroUnity's Calliope interface architecture is designed for ultra-high bandwidth 
systems. The architecture integrates fast communication channels with SRAM 
buffer memory and interfaces to standard analog channels. 

The Calliope interfaces include byte-wide input and ( 
operate at rates of at least 1 GHz. These chai 
communication link to synchronous SRAM memory < 
interfaces to analog channels. Calliope provides a 
Terpsichore system architecture. How^#%Call; 
applications. 

Calliope's interface protocol 
space into packets containii 
The packets include check j 
multiple-bit errors with,. " 
device may be in pro^ " 
cascaded to expand^ku 

Architec, 



iriels intended to 
rovide a packet 
and a controller for 
for MicroUnity's 
many interface 



ions to a single memory 
id acknowledgement, 
nsmission errors and 
^operations in each 
Jhope devices may be 



The Calliope ate] 
channel architect! 
and complies^%itl 
parameters ^^nl&W, 



... [ermes high-bandwidth 
-_-.^s serial bus architecture, 
T Herpm and Cerberus. Calliope uses 
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The Calliope architecture defines a compatible framework for -a family of 
implementations with a range of capabilities. The following implementation- 
defined parameters, are used in the rest of the document in boldface. The value 
indicated is for MicroUnity's first Calliope implementation. 




Calliope uses t 
data channels^d,,^ 
unit and CaHfey^ 
output o4j^2slUope^ip 
inierfae< resources that c 
extejntL^mc3^to!Uic.-i _. 

I Calliope 




#r ^y^-^iae^ifferential, packet-oriented 
lath int^ice between a memory control 
^Sed to be cascadeable, with the 
iput of another, to expand the 
^ . single set of data channels. An 
gte control of the selection and . timing of 
control of the timing and content of 



ie high-bandwidth interfaces. 

A^Cerberus bit-serial interface provides access to configuration, diagnostic and 
tester information, using TTL signal levels at a moderate data rate. 

Nearly all Calliope circuits use a single power supply voltage, nominally at 3 3 
t5t. 7 !r tolerance) - A second volta g e of 5.0 Volts (5% tolerance) is used only for 
1 1L interface circuits. Power dissipation is TBD. Initial packaging is TAB (Taoe 
Automated Bonding). F 

Pin assignments are to be defined: there are 174 signal pins and 466 pins for 3 3V 
power, 5.0V power and substrate, for a total of 640 pins. 
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count 


pin 


meaning 


18 


HiC, Hi7..o 


hi-bandwidth input 


18 


HoH Ho? a 


hi-bandwidth output 




















6 


SC. SD, SN3..0 


Cerberus interface 


174 




total signal pins^ 


? 


VDD 


3.3 V above V 


? 


VCC 46 


5.0 V above 


? 


VSS 


most neggiy§%apply 


640 




total pirk. 



The following is a diagram of the Caj^^e abviqe^l^ac^?(j^merical values are 
shown for MicroUnity's first implemcn: Ition 1 J* ^ 



4 




Calliope external block diagram 



46 Internal circuit documentation names this signal VDDO. 
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Absolute Maximum Ratings I MIN | NQM 1 MATIunITI 



operating conditions I MIN I NQM [ MA& | UNIT I REF 




equivalent voltage 
VDD 



— 



Ttr 



f voltage 



T3 



temperature 



T75 TM 



vss" 



-Aj ^ 
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Electrical characteristics 


MIN 


TYP 


MAX 


UNIT 


REF 


Voh ; H-state output voltage HoC, H07..0 








y 












v 


VDD 


Vih ; H-state input voltage HiC, Hi7 0 








V 


VDD 


Vil: L-state input voltage HiC, Hi7..o 








V 


VDD 


Ioh: H-state output current HoC, H07..0 








mA 




Iol : L-state output current HoC, H07..0 








mA 




Iih: H-state input current HiC, Hi7..o 








mA 




Iil: L-state input current HiC, Hi7..o 






#^ 


gffA 




Cin: Input capacitance HiC, Hi7..o 








pF 




Cout- Output capacitance HoC, H07 0 








dF 

HI — 




F^o H 0 l2^2 U 0 ^^^ lt Q ^Q7 1 1 , J|, 








V 


vss 


Vq^: L stato output voltago A^q^^ 








V 


VSS 


Vol: L-state output voltage SD . \ 






0.4 


V 


VSS 


V^f-H-state input veltago DQjq^ . 




- 




V 




V^: L ctato input voltage;©^M6^A" 








V 


VSS 


Vih: H-state input voltage 3D 




, \ 


%5 


V 


VSS 


Vih: H-state input vo1tar<« SC, SN3 0 . 






• 5.5 


V 


VSS 


V| L : L-state input voltage S%,SD',\SN 3 0 






0.8 


V 


VSS 


RASa-g CASa-cr-WR; o'^' 












l©t' L -tatc output aT^^YT^T?' 
RAS3 0 CASg,rjrW^3 <^rDQ^ -q~°' 


4 




+§ 


mA 










16 


mA 




loz: Off-s^pa!pu«iht ®T\v M 


-?Q 




10 


|iA 






%Q 




+0 


t±A 




hH-H-stetei put urrentSp.SN 3 o 


-10 




10 


fiA 




liL^Ustale - put current SWN3..0 ^ 


-10 




10 


fiA 




Cim' input capacitance SC, SN3..0 






4.0 


P F 




: %)UT: Output or input-output 
capacitance, SDr^n-^^, RASa^ 






4.0 


PF 
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Switching characteristics 


MIN 


TYP 


MAX 


UNIT 


tBc: HiC clock cycle time 


1544 






ps 


tBCH : HiC clock high time 


600 






ps 


tBCU HiC clock low time 


600 






ps 


Ibt HiC clock transition time 






100 


ps 


tBS-' set-up time, Hi7.,o valid to HiC xition 


200 




100 


ps 


tBH- hold time, HiC xition to q invalid 


-200 




-100 


ps 


tos° skew between HoC and H07 0 


-50 




5ft 


ps 


tc: SC clock cycle time 


50 








tr»: SC clock high time 


20 






ns 


tcL* SC clock low time 


20 






ns 


tr: SC clock transition time 






5 


ns 


ts: set-up time, SD valid to SC rise V A 


- 4 






ns 


tH: hold time, SC rise to SD invalid ' 








ns 


too: SC rise to SD valid Jk\ . 


5 






ns 



Logical and Phvsk 



Calliope defines 
static RAM menu 
configuration 
registers. Th< 
used to act 
access the coi 




id by an on-device 
rol . registers . and a 
id-only and read/write 
* v |he Hermes channel 
interface used to 
ly separate. 

farray o|2 8A "words of size W bytes. Each 
referencej%ll bytes of a single block. All 
block. 



Calliope's SRAM memory is a buffer for data which flows to or from interface 
devices. 

Calliope's configuration region consists of read-only and read/write registers. The 
size of a logical block in the configuration memory space is eight bytes: one octlet. 
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Communications Channels 

High-bandwidth 

Calliope uses the Hermes high-bandwidth channel and protocols, implementing a 
slave device. 

Calliope operates two Hermes high-bandwidth communications channels, one 
input channel and one output channel. 

Calliope uses the Hermes packet structure. There is no j 
to the Hermes-designated cache, so the no-allocate att 
operations has no effect.. 

Configuration-region registers provide^ 
the byte- wide input channel, and to $ 
This mechanism may be employed \ i_ 
channels, or set to fixed pattei 
device-to-device wiring. 




corresponding 
read and write 



detect skew in 
output channel, 
for skew in the 
as may arise in 



Calliope device, set 
to^nable the use of the 



£oAl%itAt|P jHermes architecture. 

wing errors are designed to be detected 



errors detected 


errors not detected 


irWaltd check byte 


invalid identification number 


invalid command 


internal buffer overflow 


invalid address 


invalid check byte on idle packet 




uncorrectable error in SRAM buffer 







Upon receipt of the error response packet, the packet originator must read the 
status register of the reporting device to determine the precise nature of the error. 
Calhope devices reporting an invalid packet will suppress the receipt of additional 
packets until the error is cleared, by clearing the status register. However, such 
devices may continue to process packets which have already been received, and 
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generate responses. Upon taking appropriate corrective actions and- clearing- the 
error, the packet originator should then re-send any unacknowledged^ commands." 

Because of the large difference in clock rate between the high-bandwidth Hermes 
channel and the Cerberus serial bus interface, it is generally safe to assume that, 
after detecting an error response packet, an attempt to read the status register via 
Cerberus will result in reading stable, quiescent error conditions and that the 
queue of outstanding requests will have drained. After clearing the status register 
via Cerberus, the packet originator may immediately resume sending requests to 
the Calliope device. fcfcS^ 



Cerberus Registers 

Calliope's configuration registers comply with, 
specifications. Cerberus registers are ^fd^a! 1 
which provide an implementation-indlgen^ln 
the configuration of devices in a Td* 
a user of a Terpsichore system <i 
purpose implementation fbr^ 
supplier of a Terpsichore^ 
without compromising cp^„^. „ . 
are accessed via the C^^Tsd 




and Hermes 
write registers 
luery and control 
of these registers, 
facilities in a general- 
itility. Conversely, a 
dirties in the device 
tI|ions. These registers 



As a device comj 
a set of Cerl 
configuratioi 
including Eut< 



liope interface contains 
additional sets of 
'erpsichore system, 
iory devices. 

iion at^put "i;he Terpsichore system 
ttatio|^Jn|fependent fashion. Terpsichore 
leather to verify that a compatible 
>r the use of the part to conform to 

_^^l/only\^isters occupy addresses 0..5. An attempt to write these registers 
^l^^cause a normal or an error response. 

Read/write registers select operating modes and select power and voltage levels 
25 32™ SiSHals ' The read/write agisters occupy addresses 6.7, 10.. 14 and 



Reserved registers in the range 8..9, 15..24 and 33. .63 must appear to be read/only 
registers with a zero value. An attempt to write these registers may cause a normal 
or an error response. 

Reserved registers in the range 64..2 l6 -l may be implemented either as read/only 
registers with a zero value, or as addresses which cause an error response if reads 
or writes are attempted. 
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The format of the registers is described in the table below. The octlet is the 
Cerberus address of the register; bits indicate the position of the field in a register. 
The value indicated is the hard-wired value in the register for a read/only register, 
and is the value to which the register is initialized upon a reset for a read/write 
register. If a reset does not initialize the field to a value, or if initialization is not 
required by this specification, a * is placed in or appended to the value field. The 
range is the set of legal values to which a read/write register may be set. The 
interpretation is a brief description of the meaning or utility of the register field; a 
more comprehensive description follows this table. ^ 

octlet bits 
0 63.. 16 



octlet bits 
1 63.. 16 




implermpri^ 


3x00 
40 

a 3 

m 

t. 




IggJJfWeS C|§tj^p%interface device 
as implemented by Microllnity. 


implementor 

revi#ctfl*\ 


3x01 

JO , 




if^emq^gW version 1.0. 



2 jg* 




a4 
6d 
ff 




Identifies initial manufacturer of 
tlftffiope interface device 
implemented by MicroUnity as 
MicroUnity. 


15..0 


manufacturer 
revision 


0x01 
30 




Manufacturing version 1.0. 


octlet bits 


field name 


value 


range 


interpretation 


3 63.. 16 


serial 
number 


0 




This device has no serial number 
capability. 


16..0 


dynamic 
address 


0 




This device has no dynamic 
addressing capability. 
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octlet bits 
4 63.. 60 
69..56 
S5..48 

47..0 



A 


4 


0..15 


size of a Hermes address 


log 2 W 


3 


0..15 


size of a Hermes word 


C 


11 


0..25 
5 


log2 of buffer capacity in words 


0 


0 


o • 


Reserved for definition in later . 
revision of Calliope architecture 
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octlet 
6 



field name value range 



reset 



clear 
selftest 



defer writes 0* 0 .1 



0..1 



set to invoke device's circuit reset 



set to invoke device's logic clear 



set to invoke device's selftest: bits 
SO. .48 may indicate depth of selftest 



set to cause writes to octlets 25.-43 
io be deferred until the next logic- 
clear or non-deferred write. 



interpretation 
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field name 'vfalueranae 



reset/clear/ 
selftest 
complete 



reset/clear/ 
selftest 
status 



meltdown 
detected 



low voltage 



0..1 



This bit is set when a reset, clear or 
selftest operation has been 
completed. 



This bit is set when a reset, clear or 
selftest operation has been 
completed successfully. 



This bit is set when the meltdown 
detector, has caused preset. 



This bit is set when the voltage or 
temperature is ton low for proper 
operation of Jgfjc circuits. 



interpretation 




'sampled on specified Hermes 
nel immediately following 
sample value in raw O register. 



- field name value range 
63..0 1 0 jo (0 {Reserved 



interpretation 



f ' 
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octlet 
10 



63.-56 
55. 48 
47. .40 
39.32 
31.. 24 
23-16 
15..12 
11. .8 
7..4 

3 

2 

1 



: bits 
63.. 56 
55 
54 
53 
52 
51 
50 
49.. 48 
47. Ai 



>27 
26 
25 



23.. 16 
15. .8 



field name 


value 


ranqe 


interpretation 


0 


3 


0 


Reserved 


PLL anob 


224 


0..23O 


PLL anaiog-knob settings 


0 


D 


0 


Reserved 


CI2 test 


3 


0..7 


CI 2 test control 


CM test 


3 


0..7 


CM test control 


CI2adc anob 


224 


0..230 


CI 2 ADC analog-knob settings 


CI2Q filter 


3 


0..7 


CI2 Q filter adjust 


CI2I filter 


3 


0..7 


CI 2 I filter adjust -.. * 


Q 


r) 


Q 


rieservea ^ 


CI2 VCO 


0 


0..1 


CI2 external MSQl Switch 


CI 2 LNA 


0 


0..1 


CI2 input LNA e;iab ! 3 


CI2Q ADC 

preamplifier 


0 




1)2 Q ADC preamplifier disable 

% .. *v*V N ■ ' 


CI2I ADC 
preamplifier 






CI2^^3B~p^|^#ifier disable 

\iy O * : — 1 


field name ^ 


. ■ v 

value 


rarfe 


' intsi&retation 


CHsyn anoll 


224 * 


3 ^ 


C!^%npesi.^e% aijalog-knob settings 


C02 


0 




^Qi%|nversilll|;Co^trol 


CO 1 insert 


J 


d..it 


cB:1 invefsidn 5 cbritrol 


C12* invert 


P 


0 1 


CI2a,iMerli#n control 


©;t2b"invert',- 


b 




CI2bJptr3ion^irol 








O la inversion control 


CI1b invert' 


P 




in^^ioW^control 




T : 


p - 


Resei^ed % 


CMaic%M' 


224' 


p ^30 
...... 


CI1 ,Ag$& analog-knob settings 


"CMGWffer 




C§7 


CM Q filter adjust 


citi liit^- 


3 




^ I filter adjust 


o 




~~ 


ncocl VfcJU 


©11 vco v 


3 


0 1 


O I I t?Alt?J 1 IGU V oWlllsl 1 


CM LNA 


0 


0..1 


CI1 input LNA enable 


CI1Q ADC 
preamplifier 


0 


0..1 


CI1 Q ADC preamplifier disable 


CHI ADC 

preamplifier 


0 


0..1 


CM I ADC preamplifier disable 


CI2syn anob 


224 


0,.230 


CI2 synthesizer analog-knob settings 


refclk anob 


224 


0..230 


reference clock divider analog-knob 
settings 


CLIO anob 


224 


0..230 


CLIO analog-knob settings 



For evaluation only 



MU 0023458 

Highly Confidential 
- 246 - microunity confidential 



Terpsithore System Architecture ~ REDACTED 



62..34 
33 
32 
31. .24 
23.. 16 
15..8 
7..0 

octlet bits 
•13 63 
62. .56 



capacitor 
calibration 


0 


0..1 


Set to enablp Pflnflritnr ralihratinn 


0 


0 


0 


Reserved 


VI invert 


D 


0..1 


VI inversion control 


VO invert 


0 


0..1 


VO inversion control 


VI anob 


224 


0..230 


VI analog-knob settinqs 


VO anob 


224 


0..230 


VO analog-knob settings 


COI anob 


224 


0..230 


C01 analoq-knob se,tti%# j 


C02 anob 


224 


0..230 


C02 analog-knQk,s%ffftgs 




%tlet 
14 



bits 
63..56 
55..48 
47.. 40 

39 
38..32 

31. .16 
15..0 



0 


0 


0 


Reserved 


EQ2 test 


0 


0..7 


EQ2 test control 


EQ1 test 


0 


0..7 


EQ1 test control 






0 


Reserved 


COI 

configuration 


0 ^ 


0..12 
7 


C01 configuration control i 




a 




eft priority? ■ 




0 




right priority? < 
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octlet bits 


field name 


value 


range 


interpretation 


15..24 63..0 


0 


0 


0 


Reserved for expansion of Cerberus 
registers upward or knobcity registers 
downward. 


octlet bits 


field name 


value range 


interpretation 


25 63-56 




224 


a.. 127 


geographical digital knob settings 


55. .48 




224 


0..127 


geographical digital knob settings 


47. .40 




224 


0..127 


geographical digital %ob> settings 


39.. 32 




224 


D..127 


geographical digitaBggpD settings 


31. .24 




224 


0..127 


geographical djgll&rlknob settings 


23.. 16 




224 


0..127 


geographica^d(^taT knob settings 


15..8 




224 


0..127 


aeographi^^gital knob settings 


7..0 




224 


0..12?^ 


geogranhJf r v knob settings 




63..S6 
55..48 
47..40 
39..32 
31. .24 
23- 16 
15.. 8 
7..0 





224 


0..127 


geographical digital knob settings 




224 


0..127 


geographical digital knob settings 




224 


0..127 


geographical digital knob settings 




224 


3.. 127 


geographical digital knob settings 




224 


0.. 127 


geographical digital knob settings 




224 


0..127 


geographical digital knob settings 




224 


0..127 


geographical digital knob settings 




224 


0..127 


geographical digital knob settings 
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octlet bits 
29 63.-56 
55.. 48 
47.. 40 
39..32 
31. .24 
23.. 16 
15..8 
7..0 

octlet bits 
29 63..56 





224 


D..127 


geographical digital knob settings 




224 


0..127 


geographical digital knob settinqs 




224 


3..127 


geographical digital knob settinqs 




224 


0..127 


geographical digital knob settinqs 




224 


0..127 


geographical digital knob settinqs 




224 


3.A27 


geographical digital knob settinqs 




224 


0..127 


geographical digital knob settinqs 




224 


0..127 


geographical digitak-k W settinqs 
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field name value range 



Hermes 
skew swing 



termination 
fine-tuning 



Voltage swing selection for Hermes 
channel skew circuits 



Global resistor mask for ail knobs. 



Reserved 



Set based on value read from PMOS 
drive strength, used tg fine-tune 
resistor values in Hj 
termination. 
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23..22 
21 



20 
19.. 16 



15.. 10 
9..0 



meltdown 
threshold 


0 

— — 


0..3 


Set to perform margin testing of the 
meltdown detector. 


conversion 
start 


0 


0..1 


Setting this bit causes the 
sonversion to begin. The bit remains 
set until conversion is complete 


0 


0 


0 


Reserved, (selection extension) 


conversion 


0* 


D..9 


"ield selects which of ten 
measurements are tat^en 


0 


3 


0 


Reserved, (counte^a.xl^sjon) 


conversion 
counter 


0* 


0..10 
23 


This field is set t%.tllpw'o 3 s 
complement o||ne1^bwnslope count. 
The counter §£>u%|s upward to zero, 
and then CQnt&uesfCounting on the 
lupslope ur version completes. 



octlet bits 
32 63 
62 




field name value range 



33-6 


J 63..0 


0 


0 


0 


Reserved for use with additional 












Hermes channel interfaces 
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REDACTED 



octlet bits 
64.. 63..0 
65536 



c 



to 



■ interpretation 



leserved for use with later revisions 
" the architecture- 



configuration memory space 



Identification Registers 

The identification registers in 
Cerberus architecture, 

MicroUnity's. company identifier is: 0000 0000 0000 0010 1 



The identification registers in octlets 0.3 comply with the recmirements of the 
Cerberus architecture, \ r 



Internal code name 




Calliope J% 





Calliope architecture revisions i 



ig table: 



Internal ccxjpmgtei W 




1.0 i 





MicroUnity's 



MicroUnity's 
as specified " 





,v.> 

Cpde nipper 


MicroUnitV . : "•" 


OxQ0 40 a3,4&db3c 




) Micro^feity, uses implementation codes 



i|croUnity's Calliope, as implemented by MicroUnity, uses manufacturer codes 
is specified by the following table: 



Internal code name 


Code number 


MicroUnity 


0x00 40 a3 a4 6d ff 



MicroUnity's Calliope, as implemented by MicroUnity, and manufactured by 
MicroUnity, uses manufacturer revisions as specified by the following table* 



Internal code name 


Code number 


1.0 


0x01 00 ~ 1 
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MU 0023466 



Architecture Description Registers 

The architecture description registers in octlets 4 and 5 comply with the Cerberus 
specification and contain a machine-readable version of the architecture 
parameters: A, W, C, AI, AO, PO, PI, VI, VO, IR, H, SO, SI, EQ, CI, CO, and 
QPSK described in this document. 

The architecture parameters describe characteristics of the Hermes interface, 
capacity of the Calliope buffer memory, and the number of audfo, phone, video, 
infrared, smartcard, and cable input and output channels, ^[^h€ number of 
QPSK cable input channels. 

Control Register 

The control register in octlet 6 is a 64-bJfe$ 
It is altered only by Cerberus accessejs^Cal! 
to this register. 

The reset bit of the control re; 
provides the ability to reset., 
one (1) to this bit is equi\ " 
(low level on SD for 3." 
values, which is an qj 
by external pins)^ 
The duration of 
effect. At th< 
bit of the stai 
register is set, an< 





ithj^pjuid write access, 
te values written 



berus specification and 
a system. Writing a 
dcast Cerberus reset 
to their power-on 
:urrent (as determined 
Iwidth logic to be reset. 
:e^|ianges to have taken 
p^lrr/selftest complete 
itus bit of the status 



The clear 

provides 

Writmg x j^pyl)j 

|co%figui 

operating state cham;i 
of^rjlion, the reser/clear/s 



;he Cerberus specification and 
I'ual Calliope device in a system, 
bandwidth logic to be reset, as 
levels. The duration of the reset is 
taken effect. At the completion 
implete bit of the status register is 



reset/ciear/selftest status bit of the status register is set, and the clear bit 
" register is set. 

The selftest bit of the control register complies with the Cerberus specification 
and provides the ability to invoke a selftest on an individual Terpsichore device in 
a system. However, Calliope does not define a selftest mechanism at this time, so 
setting this bit will immediately set the reset/clear/selftest complete bit and the 
reset/clear/selftest status bit of the status register. 

The defer writes bit of the control register provides a mechanism to adjust several 
octlets of Cerberus registers at one time with a single transition, such as when 
setting individual power levels within Calliope. Writing a one (1) to this bit causes 
writes to octlets 10 through 32 to have no effect (to be deferred) until the next 
logic-clear or a non-deferred write. When writes have been deferred, the values 
written are lost if a read of these octlets precedes the subsequent logic-clear or 
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non-deferred write. A normal or non-deferred write occurs when writing to octlets 
10 through 32 while the defer writes bit is cleared (0). 

The module id-field of the control register controls the value of the module 
identifier field of the Hermes input channel which selects this Calliope device. 

The Hermes channel disable bit of the control register provides the means to 
begin operations on the Hermes channels after a reset, clear, or error. Writing a 
one (1) to this bit causes the Hermes input channel to be ignored and forces idles 
to be generated on the Hermes output channel. Writing a to this bit 

causes the Hermes input channel phase adjustment to b^fllet, and after a 
suitable delay the Hermes channels are available for use.., * 

The cidle 0 and cidle 1 fields of the control regis! 
repeatedly sent simple patterns on the J" 
testing and skew adjustment. For norr 
zero (0), and the cidle 1 field must t 

Status Register 



The status register, is a 64 
only legal value which 
writing a non-zero- vi 




>vide a mechanism to 
:or purposes of 
id must be set to 



:e access, though the 
v ister. The result of 



r complies with the 
clear or selftest 



raiplies with the Cerberus 
tion of a reset, clear or selftest 



T egister is set when the meltdown 
rature above the threshold set by the 
configuration register, which causes a 
the power level to be forced to minimum (1). 

The low voltage or temperature bit of the status register is set when internal 
circuits have detected either insufficient voltage or temperature for proper 
operation of high speed logic circuits, which causes a logic clear until the condition 
is no longer detected (due to an increase in supply voltage or device temperature). 

The Cerberus transaction error bit of the status register is set when a Cerberus 
transaction error (bus timeout, invalid transaction code, invalid address) has 
occurred. Note that Cerberus aborts, including locally detected parity errors, 
should cause bus retries, not a machine check. 

The Hermes check byte error bit of the status register is set when a Hermes 
check byte error has occurred. 
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The Hermes command error bit of the status register is set when a Hermes 
command error has occurred. 

The Hermes address error bit of the status register is set when a Hermes address 
error has occurred. 

The raw 0 and raw 1 fields of the status register contain the values obtained from 
two adjacent samples of the specified Hermes input channel. The raw 0 field 
contains a value obtained when the input clock was zero (0), an^| the raw 1 field 
contains the value obtained on the immediately following samdler^^fen the input 
clock was (1). Calliope ensures that reading the status ^eg^|^-"produces two 
adjacent samples, regardless of the timing of the status re^sll|/lead operation on 
Cerberus. These fields are read for purposes of testing -i 
Hermes channel interfaces. 

Power and Swing Calibrai 



ritrol of skew in the 



Calliope uses a set of calibration^ 
used for internal high-bandwid%^| 
these registers are described^ laelo^, 



:o\ J%£ jtower and voltage levels 
details of programming 




Eight-bit fields separat* 
the Calliope circuit^ 
"knob") contains c^Bajj|pM 



lHused in a portion of 
;ital circuitry (labeled 



Each such field used to c< titro) analog circuitry (labeled "anob") contains 
configurat^ s rt*d#a^^ fori at ^\ 

power and swing controls 1 

The range of valid values and the interpretation of the fields is given by the 
following table: 



field 


value 


interpretation 


0 


0 


Reserved 


ref 


0..3 


Set reference voltage level 


Ivl 


0..3 


Set voltage swing level. 


res 


0..7 


Set resistor load value. 



Highly Confidential 



MU 0023468 



For evaluation only 



-256- 



microunity confidential 



Terpsichore System Architecture 



REDACTED 



The reference voltage level, voltage swing level and resistor load value are model 
figures for a full-swing, lowest-power logic gate output- The actual voltage levels 
and resistor load values used in various circuits is geometrically related to the 
values in the tables below. Designed typical, full-speed settings for the ref, lvl and 
res fields are ref=250 millivolts, lvl=500 millivolts, and res=2.5 kilohms. 

The ref field, together with the reference n fields of the configuration register, 
control the reference voltage level used for logic circuits in the specified knob 
domain. The value of the ref field is interpreted by the following table: 



ref 


reference voltage level | 


0 


reference 0 y m 


1 


reference 1 "V 1 


2 


reference 2^*% I 


3 


referenced j 



The lvl field, together with the sw 
the voltage swing level used for^" 



Ition register, control 
Bed knob domain. The 



1 iviH 


l~v0Kt*JHfcbil leveTi 




swing 0 




swing 1 




swing Z 







Values and inter] 
following table. 



i fields are given by the 



%vaJsl^ 




r^V'^wing 






275 






300 






325 


¥ 3 


175 


350 


4 


188 


375 


5 


200 


400 


6 


213 


425 


7 


225 


450 


8 


238 


475 


9 


250 


500 


10 


263 


525 


11. 


275 


550 


12 


288 


575 


13 


300 


600 


14 


325 


650 


! 15 


350 


700 
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The res field, together with the resg field of the configuration register and the 
meltdown detected bit of the status register, control the PMOS load resistance 
value used for logic circuits in the specified knob domain, referred here as the resl 
value. For each res field, the resl value is computed as: 

rest = res & (meltdown detected ? 1 : resg) 



The resl value, together with the process control field of the configuration register, 
control the PMOS load resistance value used for logic circuit^in the specified 
knob domain. Values and interpretations of the lvl field are given^WsIhe following 
table, with units in kilohms. The table below gives resistance^ ^fees with nominal 
process parameters. 




When the process control hdd of> the 
PMOS drive strength field' o%^the coMigi) 
resistance values 1 are as given by the C*" 



iguratign |p|ister is set equal to the 
^ion^gister, nominal PMOS load 
tab|e, with units in kilohms. 



V 



■sr. 


H 1 f 


1 : 


i?-' 1 — 2*% 




3 


4.2 


S 4 


3.1 


1 5 


2.5 | 


! 6 


2.1 


7 


1.8 
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When Mnemosyne is reset, a default value of 0 is loaded into each 0 field, 0 in 
each ref field, 0 in each lvl field and 7 in each res field, which is a byte value of 
224. The process control field of the configuration register is set to 20, and the 
reference n and swing n fields are set to 15. These settings correspond to a chip 
with nominal processing parameters, nominal power and high voltage swing 
operation. 

For nominal operating conditions, the ref field is set to 0, the lvl field is set to 0, and 
the res field is set to 5, which is byte value of 5. The process control field is set 
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equal-to the PMOS strength field, and the reference n and swing n fields are set t 



interface Configuration Registers 

Interface configuration registers are provided on the Calliope interface to control 
the [insert summary list of controls]. 

The CI1 test and CI2 test field of interface configuration j^ksr 10 control 
operating modes of the CIl and CI2 cable input blocks. - fra* 



Eight r bit fields separately control the operating modes>gfi%^fcable input blocks. 
Each such field contains configuration dat| in the foj%wlr^ fbfmat: 

7 3 e', - : 2. - . Y > Q 




f lQ file?, CI1I filter, CI2Q filter and CI2I filter fields of interface 
figuration register 10 and 11 control the cutoff frequency of the cable input 
aritialias filters. 

Four-bit fields separately control the cutoff frequency of each cable input antialias 
filter. Each such field contains configuration data in the following format: 



cutoff 



3 



cable input antialias filter controls 
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The range of valid values and the interpretation of the fields is given by the 
following table: 



field 


value 


interpretation 


0 


0 


Reserved 


cutoff 


0..7 


Cutoff frequency selection for 
antialias filter 



Cable input antialias filter control field interpretation 



Values and interpretations of the cutoff fields are given by #§f|Jlmowing table, 
with units in megahertz, for nominal 3 dB frequencv/%^ "specified junction 
temperature: ™ 



MU 0023472 




The CI1 VCO and; 
control the selejtil 
Writing a •fiStyTo 
selects an,e&teapai Vr 
bit, sek 



selecting a 9 MHz 



collS^uMion registers 10 and 11 
t to me tuner of the cable input. 
& WCO, while writing a one (1) 
in a zero is placed in the VCO 



! Lrf^^b of Surface configuration registers 10 and 11 
low noise amplifier) used as an input to the tuner of the cable 
f Writing \ zero (0) to the bit disables the LNA, while writing a one (1) 
kbles the LNA. In normal operation a one is placed in the LNA bit, enabling the 
LNA. 

The CI1Q ADC preamp, CI1I ADC preamp, CDQ ADC preamp and CI2I ADC 
preamp bits of interface configuration registers 10 and 11 enable the ADC 
preamplfier output used as an input to the ADC of the cable input. Writing a zero 
(0) to the bit enables the ADC preamplifier output, while writing a one (1) disables 
it, allowing the tuner input to be driven from an external pin. In normal operation 
a zero is placed in the ADC preamp bits, enabling the preamplifiers. 

The Clla, Cllb, CI2a, CI2b, COl, C02, VI, VO, AI, AO, PI, PO invert bits of 
interface configuration registers 11, 12, and 13 provide for the selective inversion 
of the relative clock phase of the analog-to-digital section internal interfaces in the 
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respective interfaces. In normal operation, a zero is placed in the invert bits, 
matching the relative phases of the interface sections. 

The COl configuration and C02 configuration fields of interface configuration 
registers 13 and 14 provide for the configuration of external devices which assist in 
the implementation of the cable output. The configuration fields drive LVTTL 
outputs which can control external filters and other components. In nornal 
operation, a zero is placed in the configuration fields. 



The PI bias, AIR bias and AIL bias fields of interface con 
control the bias current of the phone and audio input rig 
amplifiers. 

Four-bit fields separately control the bias curreni 
amplifier. Each such field contains configuration c 



register 13 
operational 










125C 






^ 200 . 








133 








100 




.3 . 




80 





The mute bit of interface configuration register 13 provides for initial muting of 
the audio and phone outputs during initial system operation. Writing a zero (0) to 
the bit enables the audio and phone outputs, while writing a one (1) forces the AO 
and PO outputs to a constant value (zero with AC coupling). 

The PO filter, AOR filter and AOL filter fields of interface configuration register 
13 control the cutoff frequency of the phone and audio output right and left 
antialias filters. 
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Four-bit fields separately control the cutoff frequency of each output antialias 
filter. Each such field contains configuration data in the following format: 



cutoff 



cable input antialias filter controls 



The range of valid values and the interpretation of the fields^is given by the 
following table: ^?V* 



field 


value 


interpretation 


cutoff 


0..15 


Cutoff frequency segclton^for 
antialias filter ^\ 



Values and interpretations of the 
with units in kilohertz, for 
temperature: 



the following table, 
at specified junction 




The EQl test and EQ2 test field of interface configuration register 14 control 
operating modes of the EQl and EQ2 cable input equalizers. 
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Eight-bit fields separately control the operating modes of the cable input 
equalizers. Each such field contains configuration data in the following format: 
7 3 2 1 0 



0 


round! 


0 


DSP 











cable input equalizer test controls 



The range of valid values and the interpretation of the fie 
following table: 




fiven by the 



A Configuration regisl 
tuning of the Hej 
parameter settin; 
and to control 



Jfce to control the fine- 
onfiguration, to control the global process 
ghe two pnjjf locked lo »p frequency generators, 

lid read*"'" ™ "''^ ' 
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The Hermes skew swing field of the configuration register control the voltage 
swing used in the Hermes channel skew circuits. The field should always be set 
equal to the value of the Ivl subfield of the Hermes channel knob field. 

The resg field of the configuration register permits the global control of the load 
resistors in all of Calliope's high-speed logic circuits. The resg field is initially 
loaded from external pins to a numinal power level (5), and can be changed again 
to a value in the range 0..7 to lower or raise the power and speed of the high-speed 
logic citcuits in the Calliope device, or can be set to all ones (7) to^nable control of 
individual sections of the Calliope device power levels. By alter^g^^alue on the 
external pins, Calliope can be configured for low-power ^1|o#^l) testing in a 
restricted packaging environment. ^£ 



The termination fine-tuning field of the configuratigit 
bias settings for PMOS loads in Her^s 
accomodate variations in circuit paraj^e^ra 1 ( 
and to provide intermediate terminal" 
conditions, the value read from t" ' 
into the termination fine-tuning 
table: 




.ontrols the analog 
" ' :s, in order to 
icturing process, 
fer normal operating 
ield should be written 
the field is given by the 



Reser^ 



g termination fine-tuning 

"" . ' . " " ' j 1 



increase PMQ-S con* 



The process coi 
settings for ~" 
variations ia cirriut p,ir< 
operating^clnd^ions, flfc 
written into the process 
tabl< jT% J 

, , 1 1 




>n register controls the analog bias 
ogic ctrcuits^ in order to accomodate 
te mafftifjrcturing process. Under normal 
drive strength field should be 
>retation of the field is given by the 



Reserved 



process control 



increase PMOS conductance to 20/va[ue*nominal. 



use PMOS loads at nominal conductance. 



decrease PMOS conductance to 20/value*norninal. 
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The PMOS drive strength field of the configuration register is a read/only field 
that indicates the drive strength, or conductance gain, of PMOS devices on the 
Terpsichore chip, expressed as a digital binary value. This field is used to calibrate 
the power and voltage level configuration, given variations in process 
characteristics of individual devices. The interpretation of the field is given by the 
table: 



value 


PMOS drive strength 




0 


Reserved A 


1-19 


value/20*nommal conductance . ^ 


20 


nominal conductance ^ v 
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21-31 


value/20*nominal conductance 




There . are two identical phase locke^-loop 
designated PLLO and PLL1. These P^B%ej 4 
signals of configurable frequency, ba|l%,uponf 
54 MHz or 1.08 GHz. PLLO i 
Terpsichore processor, 
Hermes channel interfaces, 
identical meanings, describj 

The PLLO divide rai 
each PLL, where 1< 
for PLLO, and M%2$W 
generated in 
reference is 
used. 



ncy generators, 
external clock 
reference of either 
ig frequency of the 
frequency of the 
L0 and PLL1 have 



the divider ratio for 
a nominal setting of 12 
clock signals to be 
the input clock 
Hz with prescaling 



Setting the PJ 
configuration ' 
operate (^LiKf inpu 
generajedj^fe ^^ppfio) 
di 



the PLL1 feedback bypass bit of the 

Igd clo^gypass the PLL oscillator and to 
Se&fi^fnese bits causes the frequency 
srence clock. These bits are cleared 



LO rani#field and the PLL1 range field of the configuration register are 
Ho select an operating range for the internal PLLs. If the PLL range is set to 
zero, the PLL will operate at a low frequency (below O.xxx GHz ), if the PLL range 
is set to one, the PLL will operate at a high frequency (above 0.xxx GHz). At reset 
this bit is cleared, as the input clock frequency is unknown. 

Setting the PLL prescaler bypass bit of the configuration register causes the 
phase-locked loops PLLO and PLL1 to use the input clock directly as a reference 
clock. This bit is cleared during normal operation with a 1.08 GHz input clock, in 
which the input clock is divided by 20, and is set during normal operation with a 
54 MHz input clock. At reset this bit is cleared, as the input clock frequency is 
unknown. 

Setting the conversion prescaler bypass bit of the configuration register causes the 
temperature conversion unit to use the innut_clo ck directly as a reference clock. 
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Otherwise, clearing this bit causes the input clock to be divided by 20 before use 
as a reference clock. The reference clock frequency of the temperature 
conversion unit is nominally 54 MHz, and in normal operation, this bit should be 
set or cleared, depending on the input clock frequency. At reset this bit is cleared, 
as the input clock frequency is unknown. 

The meltdown margin field controls the setting of the threshold at which 
meltdown is signalled. This field is used to test the meldown prevention logic. The 
interpretation of the field is given by the table below with a^tolerance of ±6 



degrees C, and 5 degrees C hysteresis: 



value 


meltdown threshold 


0 


150 degrees C ^ ~ 


1 


90 degrees C 


2 


50 degrees S \ K 


3 


20 degrees C ,':-,/,,r 



The conversion start bit controj 
sensor or reference to a digit 
begin, and the bit remains 
cleared. 

The conversion 
converted to a di 
below: 




Version of a temperature 
;es the conversion to 
which time the bit is 



reference value is 
lelqLis given by the table 





conversion selected - JGV 




local- temped re^nssfk - 




ibcUl temfie^ajiire refe^encg 




*©served . 
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The conversion counter Held is set to the two's complement of the downslope 
count. The counter counts? upward toffgero, at which point the upslope ramp 
begins, an J continues counting^on the upslope until the conversion completes. 

^Hermes channel Configuration Registers 

Configuration registers are provided on the Calliope interface to control the 
timing, current levels, and termination resistance for the Hermes channel high- 
bandwidth channel. A configuration register at octlet 31 is dedicated to the control 
of the Hermes channel, and additional information in the configuration register at 
octlet 31 controls aspects of the Hermes channel circuits in common. The Hermes 
channel configuration registers are Cerberus registers 32, where 32 corresponds 
to Hermes channel 0. 

The quadrature bypass bit controls whether the HiC clock signal is delayed by 
approximately ^ of a HiC clock cycle to latch the Hi7..o bits. In normal, full speed 
operation, this bit should be cleared to a zero value. If this bit is set, the 
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quadrature delay is defeated and the HiC clock signal is used directly to latchthe 
Hi7..o bits. . . , 

The quadrature range bit is used to select an operating range to the quadrature 
delay circuit. If the quadrature range is set to zero, the circuit will operate at a low 
frequency (below O.xxx GHz), if the quadrature range is set to one, the circuit will 
operate at a high frequency (above O.xxx GHz). 

The output termination bit is used to select whether the output circuits are 
resistively terminated. If the bit is set to a zero, the output has^pb^ipedence; if 
the bit is set to one, the output is terminated with a resistanclt||ual to the input 
termination. At reset, this bit is set to one, terminating the^utput 

The termination resistance field is used, to select^^i^jpe^ence at which the 
Hermes channel inputs, and optionajjj the |§ F mW e$fh|el outputs' are 
terminated. The resistance level is contf oiled- rekdvef to t&* setting of the 
termination fine tuning field of the#61§gu ^pJregiA^e interpretation of 
the field is given by the tab^^h Qhn% "and nominal PMOS 

conductance and bias settings: < 




^ current at which the Hermes 
»n of the field is given by the table, 





1 output current 


► 0 


Reserved 


1 


2. mA 


2 


4. mA 


3 
4 


6. mA 

8. mA j J 


5 


10. mA - /' Mil 


6 


12. mA 


7 


14. mA 1 



The output voltage swing is the product of the composite termination resistance: 
(input termination resistance" ^output termination resistance 1 ) !, and the output 
current. The output voltage swing s hould be set at or belo w 700 mV, and is 

Highly Confidential 



For evaluation only 



- 267 - microunity confidential 



Terpsichore System Architecture 



REDACTED 



normally set to the lowest value which permits a sufficiently low bit error rate, 
which depends upon the noise level in the system environment. 

The skew fields individually control the delay between the internal Hermes 
channel output clock and each of the HoC and Ho7..0 high bandwidth output 
channel signals. Each skew field contains two three-bit values, named digital skew 
and analog skew as shown below: 

2 0 



53 



| digital skew 



analog skew 



The digital skew fields set the number of delay stages i 
of the HoC and the Ho7..0 high-bandwidth output^' 
skew fields control the power level, anct^^reby c 
single delay stage. Setting these fiel 
relative skew between output channel 
for various values of the digital^ 
assuming a nominal setting for t" 



in the output path 
ignals. The analog 
"itching delay, of a 
Control over the 
the output delay 
Is are given below, 



digital 
skew 




^0 is loaded into the digital skew and 1 is 
a minimum output delay for the HoC 
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Hermes Hiah-Bandwidth Channel \ 

MicroUnity's Hermes high-bandwidth channel architecture is designed to provide 
ultra-high bandwidth communications between devices within MicroUnity's 
Terpsichore system architecture. 

Hermes-compliant devices include one or more byte-wide input and output 
channels intended to operate at rates of at least 1 GHz. These channels provide a 
packet communication link to general devices, processors, mem%*ie& and input- 
output interfaces. v~^ v 



Hermes high-bandwidth channels employ nine signals, < 
data signals, using differential low-voltage levels for.g" 
one device to another. The channels areidesigne/ 
consisting of up to four target devices g " 
extended to permit multiple initiators '** 

The Hermes interface protocpj 
memory space into packgtf 
acknowledgement. The pi 
transmission errors and 
operations in each de^ 
devices may be casca'dei^to expj 




signal and eight 
:ommunication from 
jnged into a ring 
*iels may also be 



operations to a single 
3, ^address, data, and 
detect single-bit 
. As many as eight 
ny as four Hermes 
Iwidth. 



Hermes relies 
level medial 
channels. This" 
skew in the ch 
may arise in 



le access to a low- 
[just skew in output 
:o adaptively adjust for 
for fixed signal skew as 



tcture ddtflcs a Compatible framework for a family of 
iptat!%jij|with a range* of capabilities. The following implementation- 
I parameters are used in the rest of the document in boldface. The value 
I are for MicroUnity's first implementations. 



Param 
eter 


Interpretation 


Value 


Range of legal values 


A 


IOQ256 words in logical memory 
space or size in bytes of a 
logical memory address 


4 


1 <A<8 


W 


size in bytes of logical memory 
word 


8 


1 <W<215 | 0 g 2 WeZ 



Hermes- devices have several optional capabilities, which are identified in the 
following table: 
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Capability 


Meaning 


Master 


Capable of generating requests on output channel and 
receiving responses on input channel 


Slave 


Capable of receiving requests on input channel and 
generating responses on output channel 


Forwarding 


Capable of forwarding requests and responses from input 
channel to output channel 


Cache 


Capable of storing values previously read or written and 
returning these values on subsequent reacfsV # 




Electrical Signalling 

Each Hermes channel consists of a one I 
constant-rate clock signal. Both the < 
signals. The clock signal contains alternating '2 
the same timing as the data signals^ 
channel byte data rate. 

Each channel runs at a 
handshaking, or flow-conf|ol 
for transmitting all nin# . 
skew; the receiver is?*e%pfis ible 
and skew as ma--, arise due to diffei 
and of each dat; 



iths%Kl a single-phase, 
^differential-pair 
transmitted with 
lency is one-half the 



o auxiliary control, 
Litter is responsible 
foeived with minimal 
the presence of noise 
ronment of the clock 



londin^tC^fernies request packets 
Ideviffe^ designated a slave device, 
innel a^he same clock rate as the input 
than a specified amount of 
input clock, over changes in 



A Hermes de 
received on a H< 
and must opei 
channel. A 
variation in the outpi 
system 



!i %at is %aplble of%enerating Hermes request packets is 
ted a fl&fter device. A master device must be capable of generating the 
nt-frequency clock signal on the Hermes output channel and accepting 
Jnals on the Hermes input channel at the same clock frequency as is generated. 
In addition, a master device must accept an arbitrary input clock phase, and must 
accept a specified amount of variation in clock phase, as may arise due to changes 
in system temperature or operating voltage. 

Each Hermes input or output channel requires 18 pads, and the associated 
Cerberus interface requires an additional 6 pads. 
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count 


pad 


meaning '- 


18 


HiC, HiC_N, Hi 7 ..o, Hi 7 0 N 


Hermes input channel 


18 


HoC, HoC_N, H07 0, H07 o_N 


Hermes output channel 


6 


SC, SD, SN 3 ~d- 


Cerberus interface 1 


36c+6 




total signal pads 



Each Hermes input channel is terminated at a nominal 50 ohm impedance to 
ground. Each Hermes output channel is optionally terminated at the same 
impedance as the devices input channel. An adjustable termination' impedance, 
programmable via Cerberus is recommended. >%/lr ^ 
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Hermes devices are generally designed to be placed on circuit boards face-down, 
so when viewed from the top of the circuit board, this becomes the ordering: 
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The following is a diagram of the Hermes and Cerberus device interfaces, for { 
device with a single pair of Hermes channel interfaces. .. 
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Switching characteristics 




TYP 


MAX 


UNIT 


tBc: HiC clock cycle time 


1000 






ps 


tBCH: HiC clock high time 


400 






ps 


tBcu HiC clock low time 


400 






ps 


tBT 


HiC clock transition time 






100 


ps 


tBS 


set-up time, H17..0 valid to HiC xition 


200 




100 


ps 




hold time, HiC xition to Hi7..o invalid 


-200 




-100 


ps 


tos 


skew between HoC and H07..0 


-50 




5ft 


ps 



Logical Memory Structure 

Hermes defines a logical memory region as <s .an array pit 
Each access, either a read or write, rejejpnces 1 " * 
addresses are block addresses, referen^g t 
8W-, X> 



„ibWks of size W bytes. 
*J^jpjngle block. All 




tally cjft||Jfci m the logical memory 
; spac« malh%jn consistency between the 
the log%jJ memory region. 



1 control commands, most commonly 
with addresses and associated data. Other 
s and responses to the above commands. 

When the Hermes channel is otherwise idle, such as during initialization and 
between packets, an idle packet, consisting of a pair of an all-zero byte and all-one 
byte is transmitted through the channel. Each non-idle packet consists of two 
bytes or a multiple of two bytes and must begin with a byte of value other than all- 
zero (0). All packets begin during a clock period in which the clock signal is zero, 
and all packets end during a clock period in which the clock signal is one. 
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The general form of a packet is an array of bytes, without a specified byte 
ordering. The first byte contains a module address in the high-order two bits, a 
packet identifier, usually a command, in the next three bits, and a link 
identification number. The remaining bytes' interpretation are dependent upon 
the packet identifier: 
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The check byte in this example is calculated as: 



• The general 




in the following table: 



Packet command interpretation 



Tie module address field provides for as many as four Hermes slave devices to be 
operated from a single channel. Module address values are assigned via either 
static/geometric configuration pins (not recommended) or dynamically assigned 
via a Cerberus configuration register. 

The link identification field provides the opportunity for Hermes master devices to 
initiate as many as eight independent operations at any one time to each Hermes 
slave device. Each outstanding operation to a Hermes slave device must have a 
distinct link identification number, and no ordering of operations is implied by the 
value of the link-identification field. There is no requirement for link-identification 
field values to be sequentially assigned in requests or responses. 
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The following section provides' detailed descriptions 'of 'the structure of each type 
of command packet. 

Idle 

Idle packets fill the space between other packets with an alternating zero-byte and 
all -ones-byte pattern. Idle packets may be dropped when received and 
regenerated between outgoing packets. The idle packet is formatted as follows: 

0 




pcate' 
copying 



addtf 

read. 

; if the 
the value 



; Hermeis ^Wl^^m a read operation for the specified 
data ^u^The^jue^is read from cache, if one is present 
present in ufe cache. If the address is not present in cache, the 
value read is placed in the cache if the command is "read- 
command is "read-noallocate" the value is returned without 
into the cache. The packet format is as follows: 



ma|com| lid 



addr 8 A-i..8A-s 



check 
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Read packet 



n 
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The range of valid values and the interpretation of the fields is given by the 
following table: 



field 


value 


interpretation 


ma 


0..3 


Module address. 


com 


4,5 


Packet command is "read-allocate" 
or "read-noallocate." 


lid 


0..7 


Respond with link identification 
number id. ^ 


addr 


0..28A-1 


Logical memory block adglggjStySs 
specified. The least signillltnt byte 
is sent first. 


check 


0..255 


Check integrity of paclet j 
transmission. j- 



If the fields are valid and the _ 
the memory is read and a rea** 
requested data value. The "re%d 
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The range of valid values and the interpretation of the fields is given by the 
following table: 



field 


value 


interpretation 


ma 


0..3 


Module address ma as specified in 
read packet. 


com 


6 


Packet command is "read response." 


lid 


0..7 


Link identification number lid as 
specified in read packet. ^ 


data 


0..28W-1 


Data read from specified s ^ddt§ss. 


check 


0..255 


Check integrity of packet , 
transmission. 




In order to reduce the latency 
read response packet before che< 
contents of the response. If, 
byte of the read response pad 
transmitted in error, the pi 
transmitting a check byte, 
Such a packet must bejt 
suppressed by Herm< 
correctable error, th#s&3mj 
contains the corrected dan 



may generate a 
that may alter the 
but before the last 
Fetects that the data was 
is, Marked as invalid, by 
.oFtifc proper check byte, 
^maylfee either ignored or 
' |l|formation indicates a 
id response packet which 
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Write Operation 

Write packets cause Hermes devices to perform a write operation, placing a data 
value into the specified address. The value is written into cache, if one is present 
and the address is present in the cache. If the address is not present in cache, and 
the command is "write-allocate", the value is written into cache. If the address is 
not present in cache, and the command is "write-noallocate", the value is written, 
leaving the cache location unchanged. The packet format is as follows: 
7 0 
majcomt lid 




jrpret|^fiVof the fields is given by the 





valf e" 


im#p»tion 






MoB^le address. 




2,3 


Packet command is "write-allocate" 
or "write-noaliocate." 


lid 


O.J 


Respond with link identification j 
number lid. 


addr 


0..28A-1 


Logical memory block address as 
specified. The least significant byte 
is sent first. 


data 


0..2«»-1 


Data to be written at specified 
address. 


check 


0..255 


Check integrity of packet 
transmission. 



Write packet field interpretation 
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If the fields are valid and the specified address is within the range of the memory, 
the memory is written and a write response packet is generated. The "write- 
response" packet is formatted as follows: 



na|com| lid 



check 



o 

Write response packet 



The range of valid values and the interpretation of the^ 
following able: «: 




given by the 



field 


value 


interpretation o \ 1 , 


ma 


0..3 


Modu4address ma as specified in 
writs packet, 


com 


7 


MoW QQEHteg&d is»eie response." 


lid 


0..7 *j 


SK\^::,; u r, er,was 


check 




Check integrity of packet 
transmission 



Error Hi 




The receipt 
specification over 
external to thj 
memory err< 
implemen 
reco: 

impleiifen* 



.requirements of this 
iy conditions internal or 
loper (SoeratBn, such as uncorrectable 
jhich ^implementation detects errors is 
this architecture specification 
pbut this is not stricdy required. All 
_ jf error detection, and all detected 
>w for handling errors. 

fermes devices, the following errors should be detected and the level of error 
ion for each of these errors is required to be documented: 
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errors detected 
invalid check byte 



invalid command 



invalid address 



uncorrectable error in cache 



uncorrectable error in device 



invalid identification number 
internal buffer overflow 



invalid module address on idle packet 



invalid identification number on idle/error packet 




Check integrity of packet 
1 transmission. 
- response packet field interpretation 
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Upon receipt of the error response packet, the packet originator must read the 
Cerberus status register of the reporting device to determine the precise nature of 
the error. Hermes devices reporting an invalid packet will suppress the receipt of 
additional packets until the error is cleared, by clearing the status register. 
However, such devices may continue to process packets which have already been 
received, and generate responses. Upon taking appropriate corrective actions and 
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clearing the error, the packet jqriginator should themre-send any unacknowledged 
commands. 

Because of the large difference in clock rate between the high-bandwidth Hermes 
channel and the Cerberus serial bus interface, it is generally safe to assume that, 
after detecting an error response packet, an attempt to read the status register via 
•Cerberus will result in reading stable, quiescent error conditions and that the 
queue of outstanding requests will have drained. After clearing the status register 
via Cerberus, the packet originator may immediately resume sending requests to 
the Hermes device. ' ^ 

Forwarding 

Hermes devices, whether master or slavfj 
packets which are intended for other dj) 
slave devices, this forwarding is perl 
module address field in the packet,^ptd^l 
than that of the current device pi 
such module addresses mustJbB 
devices, this forwarding is t 
the packet; packets which%o4^ainJ 
are forwarded. 



ibility to forward 
channel. For 
of|31e contents of the 
bnta%JPmodule address other 
packets which contain 
packets. For master 
itifier number field in 
:rated by the device 



To minimize rinj 
rninimal latenc; 
is in use, 

The size of the 
generation of 
an addition: " 
the numl 
Howev< 
ind< 
hi _ „ 

St this 





xd these packets with 
: output channel 
is required. 

^pendent. Avoiding the 
iffer^loes not have room to hold 
arding buffer is smaller than 
arding (generally 24 packets), 
output packets may be inhibited 
it require forwarding. Starvation may 
>nfiguration considerations beyond the 



rackets which contain a check byte error may be forwarded; however it is 
recommended that such packets be transmitted with a check byte containing 
more than one error bit, to minimize the possibility of an undetected second error. 
Packets which contain a "stomped" check byte may be forwarded as is, or may be 
ignored by a forwarding device. Note that when a packet is forwarded with 
minimum latency, the output channel may begin transmitting a packet before the 
input channel has received the entire packet: in such a case, the only available 
choice is to continue forwarding the packet even if a check byte error or 
"stomped" check byte is detected. . . 
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Ring Configurations 

Hermes supports a variety of ring configurations. All devices in a cascade must 
have the same values for A and W parameters, in order that each part may 
properly interpret packet boundaries. The table below summarizes the 
characteristics of the configurations available: 




J may cSltgp a cas%de of up to four Hermes slave devices. 
:ade o&rlivices will have the same or greater bandwidth as a single device, 
Bore latency. Each Hermes slave device must be configured to a distinct 
bdule address, and each slave device must forward packets that contain module 
address fields unequal to their own. 



M 



Hermes single-master ring 
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Packets are explicitly addressed to a particular Hermes device; any packet 
received on a device's input channel which specifies another module address is 
automatically passed on via its output channel. This mechanism provides for the 
serial interconnection of Hermes devices into strings, which function identically to 
a single device, except that a string has larger capacity and longer response 
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latency. Each slave device may have as many as eight transactions outstanding, 
each containing distinct id field values. 

Dual-master Pair 

A dual master pair consists of two master devices and no slave devices. Each 
master device may initiate read and write operations addressed to the other, and 
each may have up to eight such transactions outstanding. No forwarding is 
required for either device as packets are sent directly to the recipient. 




Multiple- master Single-slave, 

A multiple-master ring may cont; 
slave device, provided that th< 
for their requests. Each masl 
devices must forward pad 
the values in the id fi< 
packets are designate^ 



and a single Hermes 
use different id values 
transactions. Master 
m, as designated by 
•ackets, as all input 



;er riri^g^ cc«lun ||ul$^^ter devices and as many as four 
.provj^r tl^h|^^.^r' devices arrange to use different id 
™> ^ est %Eafh sl%T may have up to eight transactions 
J each mast$10may use^a share, of those transactions. Master 
ward packets not specifally addressed to them, as designated by 
Values in the id field. Slave devices must forward packets not specifically 
Iressed to them, as designated by the value of the ma field. 





M 




M _^ M 




M 


s _^ S 






Hermes multiple-master multiple-slave rinq j 
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Response Packet Timing 

In general, a received packet which is interpreted as a command causes a 
response packet to be generated. The latency between the end of the request 
packet and the beginning of the response packet is affected by the processing and 
forwarding of other packets, by the pres ence or absence of the r equested WO rd in 
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the cache, as well as implementation-dependent device parameters and 
characteristics. 

With full knowledge of the cache state, configurable parameters and 
implementation-dependent characteristics, a Hermes master may completely 
model the latency of responses. However, dependence on such characteristics is 
not recommended, except for testing and characterization purposes. 

A Hermes master must have the capability to detect a time-out condition, where a 
response to a request packet is never received. The length^P^e* time-out is 
implementation-defined, and dependent upon the implemen^a^i*6f the Hermes 
slave devices, so it is recommended that this time-o^jl^^plong enough to 
accomodate variation in the design of Hermes slave dev^es^pr be configurable to 
permit recovery in a minimum implementation-dep< 

Cerberus Registers 

The Hermes channel archite< 
architecture. Only the specij 
defined below. 

Hermes requires that 
order byte of the r ' 

The format of| 
Cerberus ad< 
The value indlpd. 
and is the value 
■ If 

required by 
range is 
interprj 
more 

o<^& Jtfts 4 
I 63.. 60 
. " 59..56 
55..0 



Cerberus serial bus 
ies-cx>mpliant devices are 




-Mailable in the high- 
licated below. 

The octlet is the 
field in a register. 
>r a read/only register, 
a reset for a read/write 
or if initialization is not 
ided to the value field. The 
write register may be set. The 
utility of the register field; a 



|w A 


4 


1..15 


size of a Hermes address 


Iog 2 W 


3 


□..15 


size of a Hermes word 


jnot specified 






not specified by Hermes architecture 



Architecture Description Registers 

The architecture description registers in octlets 4 and 5 comply with the Cerberus 
specification and contain a machine-readable version of the architecture 
parameters: A, W described in this document. 
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Cerberus Serial Bus 

MicroUnity's Cerberus serial bus architecture is designed to provide bootstrap 
resources, configuration and diagnostic support for MicroUnity's Terpsichore 
system architecture. . 

The Cerberus serial bus employs two signals, both at TTL levels, for direct 
communication among as many as 2 s devices. One signal is a continuously running 
clock, and the other is an open-collector bidirectional data signai^EW additional 
signals provide a geographic 8-bit address for each device. A g%^Jgf protocol and 
optional configurable addressing each provide a means tOj?%efcf Cerberus to as 
many as 2 16 buses and 2 24 devices. 

The protocol is designed for universal 
implement the Terpsichore system 
compatible with implementations < 
by Xilinx, Altera, Actel and othen 
Cerberus protocol in a minimi 
those made by Dallas Semicoi 
bus), Signetics (I 2 C bus), 
adapt the Cerberus prot 
existing systems for tl * 
configuration, and 

The Cerberus , 

Terpsichore; llfSpoi 
the Cerberus ffiuj 
Terpsichore, tl^ 
required for inMa! 




tneJ?tistom chips used to 
"esigned to be 
s,|such as those made 
used to adapt the 
bus devices, such as 
; ber parts), ITT (1MB 
>arts can be used to 
132^22/423/485 links to 
lanufacturing test and 



program load of 
la Cerberus. Because 
ie first instruction of 
iat no transactions are 
assignment. 
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The SC signal is a continuously running clock signal at TTL levels. The rate is 
specified as 20 MHz maximum, 0 (DC) minimum. The SC signal is sourced from a 
single point or device, possibly through a fan-out tree, the location of which is 
unspecified. 

The actual clock rate used is a function of the length of the bus and quality of the 
noise and signal termination environment. The amount of skew in the SC signal 
between any two Cerberus devices should be limited by design to be less than the 
skew on the SD signal. 



The SD signal is a non-inverted open-collector (0 = driven <g 
high) bidirectional data signal, at TTL levels, used for s " " 
devices on Cerberus. 



One of several termination networks ma; 
joint design targets of network size,, 
schemes employs a resistive pull-u] 
above Vss- A more complex tei 
including diodes, or the "Forc< 
SCSI-2 standard may be a<\ 
voltages as high as 3.3 V ajff§|i 




released = 
lunication among 



^depending upon 
of the simplest 
Ohms to 3.3 Volts 
termination networks 
twork proposed for the 
"irations. Termination 



<2 
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The following table specifies parameters that must be met by Cerberus-compliant 
devices. Voltages are referenced to Vss- 



Recommended operating conditions 


MIN 


NOM 


MAX 


UNIT 


Operating free-air temperature 


0 




70 


C 




Electrical characteristics 


MIN 


TYP 


MAX 


UNIT 


Vol: L-state output voltage 


0 




0.5 


V 


Vih: H-state input voltaqe SD 


2.0 






V 


Vih: H-state input voltaqe SC, SN a n 


2.0 






V 


V|t_: L-state input voltage 


-0.5 






V 


lou L-state output current 48 






V 


mA 


loz: Off-state output current 49 


-10 4 




10 


fiA 


Cour Output- Capacitance , | 




____ 


s4 5 0 


PF 




4T J*» * 


9 — ' 






d in Cerberus is to 
unique among all 
>f the device, so that 
rated. 



ensure that 
devices on 
the address 



Jequires at fM¥t 16 devices, the geographic addressing method 
Jiment of addresses 0 through 15 by directly wiring the low-order 
5 of the address in binary code using input signals SN3..0. For these purposes, 
ng to a logic high (H) level supplies a value of 1, and wiring to Vss or logic low 
(L) level supplies a value of 0. 



47 Cerberus recommends, but not require, compliant devices be able to sustain input levels 
provided by 5V TTL-compatible devices on the SC and SN3..0 inputs. 

48 Devices which fail to comply with the low-state output current specification may operate with 
Cerberus-compliant devices, but may require changes to the termination network. System 
designers should evaluate the effect that limited drive current will have on the worst-case Low- 
state signal level. 

49 Devices which fail to comply with the off-state output current specification may operate with 
Cerberus-compliant devices, but may limit the number of devices which may co-exist on a single 
Cerberus bus. System designers should evaluate the effect that additional leakage current will 
have on the worst-case High-state and Low-state signal levels. 
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The table below indicates the wiring pattern for each device address from 0 
through 15: 




0 through 255 
Additional code 
_^_iput signals SN3..0 as 
aboj^ iStyid H, and two additional 
anW an InvVted copy of the SC signal 
lals, eajh^ wired to one of four values, 



pis co^pfucjelPu^^Ae algorithm: If the desired device 
4, for each input ligfiaTSNx, where x is in the range 3..0, wire 
I four signa§*ft H, SC, or SC_N, according to the following table, 
I value of bit 4+x and bit x of N. 



N 4+ x 


N x 


SN X 


0 


0 


L 


0 


1 


H 


1 


0 


SC 


1 


1 


SC_N 
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The table below indicates -thV wiring pattern for some device addresses: 



Device 
address 


Binary 
code 


0IN3 


SN2 


SN-j 


SNo - 


16 


00010000 


L 


L 


L 


SC 


. 17 


00010001 


L 


L 


L 


SC_N 


18 


000100.10 


L 


L 


H 


SC 


19 


00010011 


L 


L 


H 


SC_N 














29 


00011101 


H 


H 




r sc n 


30 


00011110 


H 


H 




SC 


31 


00011111 


H 


H 




SC N 


32 


00100000 


L 


L 




L 


33 


00100001 


L 


L 




H 


34 


00100010 


L j 


.. h 




L 














254 


11111110 








SC 


255 


11111111 


WW® 






SC_N 



The diagram below shows die waveform -ot the SC signal ami the four signals that 
each of the SN3..0 inpuis m »y i e wi J ro 
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The values shown in the diagram above are decoded using four copies of the 
following logic, one for each value of x in the range 3..0: 



















D 






% SN - - 

X 






Q 












G 




— ~^ ) ^ * 


SG , 


- 


■■ 


























D 












Q 










—~c 


G 


X 


















;:~ 





The NU and NL values arc . ombined tocher in the older 

1 i4x#^"^£^ 



to construct an c 



fee number bv ? 



yoper^^SsIre addressed. 
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: mechanism by which any device 



Bit-LevefiPmtqkt 

The couimunicatiun protocol rests upon a basic mechanism by winch any device 
/ transmit one bit .of information on$»e feus, which is received by all devices on 
1 - at once. Implicit in this' mechanism is the resolution of collisions between 
which may transmit at the same time. 

Jlch transmitted bit begins at the rising edge of the SC signal, and ends at the 
next rising edge. The bit value is sampled by all devices at the next rising edge of 
the SC signal, thus permitting relatively large signal settling time on the SD signal, 
provided that skew on the SC signal is adequately controlled. 

The transmission of a zero (0) bit value on the bus is performed by the transmitter 
driving the SD signal to a logical-low value. The transmission of a one (1) bit value 
on the bus is performed by the transmitter releasing the SD signal to attain a 
logical-high value (driven by the signal termination network). If more than one 
device attempts to transmit a value on the same clock period (of the SC signal), the 
resulting value is a zero if any device transmits a zero value, and is a one if all 
devices transmit a one value. We define the occurrence of one or more devices 
transmitting a zero value on the same clock cycle where one or more devices 
transmit a one value as a collision. 
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Because of this wired-and collision mechanism, if a device transmits a zero value; 
it cannot determine whether any other devices are transmitting at the same time! 
If a device transmits a one value, it can monitor the resulting, value on the SD 
signal to determine whether any other device is transmitting a zero value on the 
same clock cycle. In either case, if two or more devices transmit the same value on 
the same clock cycle, neither device, in fact, no device on the bus can detect the 
occurrence, and we do not define such an occurrence as a collision. 

This collision mechanism carries over to the higher levels of the protocol, where if 
two or more devices transmit the same packet or carry on the sami transaction, no 
collision occurs. In such cases, the protocol is designed so, f|§Nhe transaction 
occurs normally. These transactions may occur frequentlp^l^i identical devices 
are reset at the same time and each initiates bus ^nl%c3ons, such as two 
processors each fetching bootstrap code from a smgle~sha&d ROM device. 



Packet Protocol 

The packet protocol 
bus in units of eight 
collisions between devices 
The transmission provi 
for controlling the ratej i 
also provides for th^trlisi 

Each packet 
zero (driven) 
starting with 
bit is transmittej 
(released) 
next byte, 





transmit information on the 
whjle resolving potential 
' :ansmitting a packet. 

Ion errors, and 
:ty. The protocol 



^.-ich.SD always has a 
oi the In - dm byt< are serially transmitted, 
bitl>Afte%&nsrmttMgfc eight data bits, a parity 
^nielfwitlfl^|monal data, a single one 
least-significant bit of the 
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Otherwise, on the cycle following the transmission of each parity bit, any device 
may demand an additional delay of two cycles to process the data by driving the 
SD signal (to a zero value) and then, on the next cycle releasing the SD signal (to a 
one value), making sure that the signal was not driven (to a zero value) by any 
other device. Further delays are available by repeating the pattern of driving the 
SD signal (to a zero value) for one cycle and releasing the SD signal (to a one 
value) for one cycle, and ensuring that the signal has been released. Additional 
bytes are transmitted immediately after the bus has been one (released) on the 
"d" (delay) clock cycle, without additional start bits, as shown in 4? figure below. 





Any Cerberus device may abort 
parity error or a deadlock conditic%1k £ 
zero value) on the "d" (delay) \ 
cycles, for a total of 12 cyclf- 51 
detected by all devices, m 
transmission error has pl^edpdey i .i% i s 
detects an abort driv< 
(abort) cycle state, 

the bus to as many as 22 constcmm 
cycle) transacl 
transaction. 



,befause of a detected 
the SD signal (to a 
as well as the next ten 
;ure that the abort is 
where a single-bit 
Each device that 
jen cycles after its "a" 
iay have devices driving 
>el§w shows a typical (12 
late :f«:%ansmission of the 



; may ^etPtne USfcberus bus and all Cerberus devices, by 
I (to a zero value) for at least 33 cycles. This is sufficient to 
f that all devices receive the reset no matter what state the device is in prior 
: reset. Transmission may resume after the SD signal is released (to a one 
value) for two cycles, as shown in the figure below. 



-AM/WWWWVWWl 

/ 



t reset reset reset I 



reset reset reset reset start start 



reset followed by transmission 
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The state diagram below describes this protocol in further detail: . 

r- — _ * ? 




state 


out 


in 


QO-0 




action 


s 


s 


is 


0 | 

■C* 




s 0 iff transmj firs£b'yte. Must wait in this 
.s^Jg>*bn$ cy< fo (wit{ s= t) if transmitting a 
n&w transaction. 1 ! 


0 


do 


"o ^| 
— _ 








1 


di 


n - 


2 


a 


:fctot)f data.lf!*} &~I^ose arbitration. 


2 


d 2 




s ; 




Bt 2i&data. If d 2 ii,'lose arbitration. 


3 


d 3 




4 




bikiWdata. If ^M- ls, lose arbitration. 


4 


d 4 






5 


tffi&#f data.li.d4 &~ 14, lose arbitration. 


5 


d 5 # 








H|#%f dat#J^d5 &~ is, lose arbitration. 


6 










bit C of data. If de &~ i6, lose arbitration. 


7 


*• 


.'7? 


>■ 




15ft TOdB^J of data.. If 67 &~ i7, lose 
arbitration. 




P 




d 


d 


P = ~ A i7..n (odd parity); abort if p A i D . 




d 


id 


a 


s/0 


d = 0 iff transmit delay, abort, or reset. If 
id=1, go to state 0 if not last byte of packet; 
else state s. 


a 


a 


ia 


g 


d 


a = 0 iff transmit abort or reset. If i a = 0, 
abort transaction. 


9 


0 


N/A 


9* ■ 


N/A 


stay in state g 10 times, then qo to state r. 


r 


r 


ir 


r 


s 


r = 0 iff transmit reset. If i r = 0 and have 
been in this state 12 times, reset device. i 



In order to avoid collisions, no device is permitted to start the transmission of a 
packet unless no current transaction is underway; To resolve collisions that may 
occur if two devices begin transmission on the same cycle, each transmitting 
device must monitor the bus during the transmission of one (rele ased) bits. If any 
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of the bits of the byte are received as zero (driven) when transmitting a one 
(released), the device has lost arbitration, and must transmit no additional bits of 
the current byte or transaction. 

A device which has lost the arbitration of a collision, or has suffered the 
occurrence of a transaction abort, may retry the transmission immediately after 
the transmission of the last byte of the current transaction, as shown in the figure 
below. 



sc /\J\]\P\J\J\f\J\J\J\J~^^ 


SD |H 


Ifjgy d \ s [gfci '^^B^Sffi 






parity delay start 

re-transmission after Joss of ar!g£ 


fc^^py delay 




All other devices must wait one addi 
another message, as shown in the 
have collided perform their open 
again. 



■re transmitting 
all devices which 
'of devices arbitrate 



All initiator-q 
clock cycles 
clock cycles, at 
abort the currepf ■ 
bytes of zer< 



...^ more than 256 idle 
feeing this many idle 
fbles, such devices must 
:t, which consists of two 



Slow dc 
transa< 



Pthe transmission of packets in a 
permitted ^fgl^p^cycles. Such devices may avoid the 
waying^fche completion of the transmission of the previous 
: time is less than the time-out limit, as shown in the figure 
way, devices of any speed may be accommodated. 



AMAAAA/WWWWVl 



parity delay abort delay abort delay start 

delaying completion of previous packet until ready to respond 



It is necessary that initiator -capable and other devices cooperatively avoid 
collisions between the time-out packet and transaction responses. The 
responsibility of the initiator devices is to inhibit transmission of a time-out packet 
if, before the time-out packet can be transmitted, some device begins transmitting, 
even if such a transmission begins after 256 idle clock cycles have elapsed. If the 
design of a target device ensures that no more than 256 idle clock cycles elapse 
between packets of a transaction, it need not be concerned of the possibility of a 
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collision- during the transmission of a response packet. Otherwise, the 
responsibility of the target devices is to inhibit transmission of a response if some 
other device begins transmitting a time-out packet after at least 256 idle clock 
cycles have elapsed. 

A device which requires delay after an aborted transaction or a reset may cause 
such a delay by forcing the delay bit after the first byte of the immediately 
following transaction, as required. If in such a case, the device cannot keep a copy 
of the first byte of the transaction, it may force the transaction initiator to 
retransmit the byte by aborting that following transaction after -i suitable delay 
has been requested. 

Transaction Protocol 

A transaction consists of the transmission^ 
begins with a transmission by the tran%ctio1 
net, device, length, type, and paylo%<^^%e^ 
packet is in the range 128.. 25? 
packet, which contains a 
terminates with a packet ( w : gf| s # tvpj 
transaction continues • ' * 
initiator and the speci| 




™ "The transaction 
Ipecifxes the target 
i — . If the type of the 
ids with an additional 
>ad. The transaction 
..127, otherwise the 
:tween transaction 



The general f< 



The general fc 
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The range of valid values and the interpretation of the bytes is given by the 
following table: 



Field 


Value 


Interpretation 




0..2 16 -1 


network address of target, relative to 
network address of transaction 
initiator. Value is zero (0) if target is 
on same bus as transaction initiator. 


de 


0..255 


device address, in this case^an 
absolute value, i.e., not rg|a1fl$rto 
device address of transaction j 
initiator ' 


L 


0..255 


payload length, or nurr^er of bytes 
after transaction c0db(t> 


T 


0..255 


transaction coder !r tn^i.'-?nsaction 

^^^^^^^l^^^^ion code is in 

!raffiJetiot> c^'VtPL.esi%ith 
additional pacfcets^ 


P9,Pi,...PL T t,4 




Payfoad 01 in -lion. 



The valid tn 



mnemonic 




1 ~^8- 



10 



>folI 




stn J|rpreta^ori 

: traf%acti6ri^|r6r: bus timeout, 
1nvalid%ansaction code, invalid 



JBtion complete: normal 
_ r .'nse to a write operation 
data returned from read octlet 



; eserved for future definition 



write octlet 



read octlet 



130.. 255 reserved for future definition 
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I 



general transaction byte interpretation 
All Cerberus devices must support the transaction codes: te, tc, d8, w8, and r8. 

All Cerberus devices monitor SD to determine when transactions begin and end. 
A transaction is terminated by the completion of the transmission of the specified 
number of payload bytes in a transaction with code in the range 128..255, or by the 
transmission of an abort sequence. For purposes of monitoring transaction 
boundaries, only the L byte is interpreted; the value of the T byte (except for the 
high order bit) must be disregarded. This is of particular importance as many 
transaction codes are reserved for future definition, and the use of such 
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transaction codes between deyjces which "support them must be permitted, even 
though other devices on the Cerberus bus may not be aware of the meaning of 
such transactions. A Cerberus device must permit any value in the L byte for 
transactions addressed to other devices, even if only a limited set of values is 
permitted for transactions addressed to that device. 

Transactions addressed to a device which does not provide support for the 
enclosed transaction code or payload length should be aborted by the addressed 
target device. 

The selection of the payload length L and transaction code^l|pHhe transaction 
error packet is of particular note. Because the value of aJ^^|mation bits of the 
packet is zero, it is guaranteed that a device which tran^u^tnis packet will have 
collision priority over all others. 

Write Octlet 

The "write octlet" transaction a 
the transaction initiator to the 
device address. The transact* 



of d ii| to be transferred from 
an octlet-aligned 16-bit 

**:of the form: 




MU 00235! 1 



The data to be transferred to the target device is assembled into an octlet as (most 
significant byte is transmitted first): 



63 56 55 



48 47 



40 39 32 31 24 23 16 15 



i < P2 I Pa I P* I P. I Dfi I Dt 



Side-effects due to the alteration of the contents of the octlet at the specified 
address are only permitted if the transaction completes normally. In the event that 
the write octlet transaction is aborted at or prior to the transmission of the Ai 
byte, the target device must make no permanent state changes. If the transaction 
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is aborted at or after the transmission of the Do byte, the contents of the octlet at 
the specified address is undefined. If alterations of the contents normally would 
cause side-effects in the operation of the Cerberus device or side-effects on the 
contents of other addressable octlets in the device, these side-effects must be 
suppressed. 

If the addressed target device is not present on the Cerberus bus, the transaction 
will proceed to the point of transmitting the octlet data and then stop until the idle 
time-out limit is reached. At that point, one or more initiator-ca^ble devices will 
generate an error response packet. 



J^ffs, but the 16-bit 
ist generate an error 



If the addressed target device is present on the Cerb 
device address is not valid for that device, the targe$j 
response packet. 4 



Read Octlet 



The "read octlet" transaction cause^ight^D^es. of data to be transferred to the 
transaction initiator from the addret. e<l ufget de^ic^lif'an octlet-aligned 16-bit 
device address. The transactic%xb%uis ^^IW^e^ueskpackei of the form: 

The normal respond .u Jus request is o{ {be tornj; 

The error respoi 



The 16jfe^d®Cc< 




ctlet address (not a byte address) 
ts (most significant byte is transmitted 
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The data to be transferred to the target device is assembled into an octlet as (most 
significant byte is transmitted first): 

63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 

I Do I Pi I p* I d 3 I p* I b s I P& TdH 

88888888 



Regardless of whether the transaction completes, the read octlet transaction must 
have no side-effects on the operation of the Cerberus device or the contents of 
other addressable octlets. 
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If the addressed target device is not present on the Cerberus bus, the transaction 
will proceed to the point of transmitting the octlet address and then stop until the 
idle time-out limit is reached. At that point, one or more initiator-capable devices 
will generate an error response packet. . • 

If the addressed target device is present on the Cerberus bus, but the 16-bit 
device address 'is not valid for that device, the target must generate an error 
response packet. 



Dedicated Octlete 

Certain octlet addresses are assigned by which all 
identified as to device type, manufacturer, revision, 
individually reset and tested. All or part <gf octlet 
this purpose. 

octlet 63 56 55 48 47 40 3Sjf%Q2 3,1 

o 1 ~ ■ - ■ 

1 

2 
3 
4..5 
6 
7 

8.. 
2 16.1 




devices may be 
'hich devices may be 
- are reserved for 



^ t company which specifies the 
: architecture (e.g. Mnemosyne, 
& MicroUnity, partner), the device 
Manufacturing version {e.g. 1.0,1.1,2.0), 
fee serial number. Addresses 0 through 2 are 
fly; an N %ifempt to write to these addresses may cause either a normal 
nation or an error response. Address 3 may be read/only or read/write, 
octlet 63 t ■ 1fi 1fi ft 



architecture code 


architecture 
revision 


implementor code 


implementor 
revision 


manufacturer code 


manufacturer 
revision 


serial number 


configurable 
address 


48 


16 
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The octlet at address 0 contains an architecture code and revision identifier. The 
architecture code and revision identifies each distinctly designed architecture 
version of a device. Normally, a change in the upper byte of the revision indicates 
a change in which features may have changed. A change in the lower byte of the 
revision signifies a change made to repair design defects or upward-compatible 
revisions. 

The architecture code is a unique 48-bit identifier, comprised of the concatenation 
of a 24-bit unique company identifier 50 , and a 24-bit value specified by the 
designated company. This code must not duplicate 48-bit ide^^^%ecified for 
this purpose, or for other purposes, including use of u^^^'identifiers for 
implementation codes, manufacturing codes, or in IEEE jMif^gflEEE 802, IEEE 
802 48-bit identifiers are specified in terms of a binary ^Sd%ig of bits on a single 
line; for Cerberus, the ordering which is appropriate 1 & that labelled "CSMA/CD 
and Token Bus," where bits are driven pti^ Cerb^^mth t^^^st-significant bit 



of each byte first. 
MicroUnity 's architecture codes 




e revision 



codes. 



IEEE. Ask for a 'unique 



^Unity's unique company identifier is: 0000 0000 0000 0010 1100 0101. Only MicroUnity 
y assign unique 48-bit identifiers that begin with this value. Others may assign 48-bit 
identifiers that begin with a 24-bit company identifier assigned by authority of the IEEE. 

MicroUnity will, upon request, supply unique 48-bit identifiers for architectures, implementors, 
or manufacturers of designs which are fully compliant with the Cerberus Serial Bus 
Architecture. For assignment of identifiers, contact MicroUnity: 

Craig Hansen, Chief Architect 
Registration Authority for Unique Identifiers 
MicroUnity Systems Engineering, Inc. 
255 Caspian Drive 
Sunnyvale, CA 94089-1015 t»flU 
Tel: (408) 734-8100 
Fax: (408) 734-8136 
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lues of the 



The octlet at address 1 contains an implementation ? .code and revision identifier. 
The implementation code-- and revision identifies each distinctly designed 
engineering version of a device. The implementation code is a unique 48-bit 
identifier, as for architecture codes. Normally, a change in the upper byte of the 
revision indicates a change in which features may have changed, or in which all 
mask layers of a device have been modified. A change in the lower byte of the 
revision signifies a change made to repair design defects or in which only some 
mask layers of a device have been modified. 

Refer to the designated architecture specification for 
implementation code and revision fields. 

The octlet at address 2 contains a manufacturer code i 
manufacturer code and revision identifies^ each dis 
of an implementation. The manufacture! 
architecture codes. Changes in the^ * 
modifications made to any or all i 
device performance. 

Refer to the designated 
manufacturer code and r< 





'ision identifier. The 
ifacturing database 
" lentifier, as for 
lay result from 
»r improve expected 

the values of the 



The octlet at addrej 
random number ai 
octlet does not 
value. 



dee serial number or 
p address register. If the 
:ontain a 64 -bit zero 



tumb^^lfiliust be a unique 48-bit 



a 'value chosen from a uniform 



^able addrcsS regH^r permits a system design in which some 
« ™ J identical Cffberus device addresses at system reset time, and 
ifcally hafe their addresses moved to unique addresses by some Cerberus 
^ fee. The configurable address register must be set to the address designated 
by the SN3..0 pins whenever the device is reset. A device which implements the 
configurable address capability must also implement either a unique device serial 
number or a random number, must implement the arbitration mechanism during 
responses from read-octlet requests, and must ensure that all devices which, are 
originally set to the same address at reset time respond to a read-octlet with 
identical latency. An initiator device on Cerberus may set the configurable 
address register by reading the entire octlet at address 3, reading both the 
serial/random number and the configurable address register. By the use of the 
bitwise arbitration mechanism, only one device completes the read-octlet response 
packet. Then, the initiator device writes a value to octlet address 3, where the first 
48 bits of the value written must match the value just read. All target devices then 
examine the first 48 bits of the value written, and only if the value matches the 
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contents of the serial/random number on the device, uses the last 16 bits 51 to load 
into the configurable address register. The initiator will repeat this process until 
there are no more devices at the original/reset address, at which time a bus time- 
out occurs on the read-octlet transaction. 

The octlets at addresses 4 and 5 contain architecture parameters. Values are 
device-architecture-dependent and implementor-independent; refer to the 
designated architecture specification for information. Addresses 4 and 5 are 
read/only; an attempt to write to these addresses may cause^gither a normal 
termination or an error response. #1^*^ 
octlet 63 v-. r 0 



4 


architecture parameters not spec 




5 


architecture parameters not spec! 


Had by Cerberus 




Ocdet 6 designates overall device 
by external devices and not by 
bits of the first byte have stands' 
not specified by Cerberus exi 
only by external devii 
architecture specificatij 

Writing a one to b: 
reset, which 
zero value) f< 
previous devil,^.. 
variable power set^i|pfare 
and 62 of the sja " 



6 are changed only 
;ter is read/write. Two 
devices. Bits 61. .0 are 
values are changed 
to the designated 



perform a device circuit 
(fivm&the SD signal (to a 
:e to^gjmitial state in which 
illff^llngs may be lost and 
ilue, after which bits 63 



Writing a <A||> 
clear, wh^P^twfees^ie jj 
previous device stare nay be 




[evice to perform a device logic 
quiescent, initial state, in which 
not affect control register settings 
u^ after^iich bits 63 and 62 of the status register 



friting a one to bit 61, s, of octlet 6 causes the device to perform a self-test, after 
which previous device state may be lost, and after which bit 62 of the status 
register below is set (to one) if the self-test yields satisfactory results. Bit 63 of the 
status register below is set (to one) at the end of the self-test. 

octlet 63 62 61 60 0 
6 1 r | c | s | other device settings not specified by Cerberus | 



51 A 16-bit field provides for the possibility of configuring devices which respond to addresses 
directly that have net numbers set, thereby blurring the dividing line between Cerberus net 
addresses and device addresses. Gateway designers might want to consider this possibility. 
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Octlet 7 designates device status.- Values in- address '7 are normally- modified- only 
by the- device itself 5 except when an external device may clear status or error 
conditions; this register is read/write. However, the only valid data which Can be 
written to this register is a zero value, which clears any outstanding status or error 
reports. Two bits of the first byte have standard meaning for all Cerberus devices. 
Bits 61..0 are not specified by Cerberus except by the restriction that these values 
are modified only by the device itself except for clearing by an external device; 
refer to the designated architecture specification for information. 



Bit 63, c, of octlet 7 indicates whether the device has < 
self-test. 

Bit 62, s, of octlet 7 indicates whether the device has su| 
clear, or self-test, 
octlet 63 62 61 

feTsT~ 



other device 



Octlets at addresses 8..2 1 
designated architecture sj 




let, clear, or 



ifly completed reset, 



Cerberus 



•berus. Refer to the 



The Cerberus 
Gateways 

described abov^A^pa#^ Mte» 
retransmits bus' re^x^fs. an< 
reaching to adjfi|g|g£f gtsh§gros 
protocol use/ 



using a gateway, 
"signalling protocol 
us and receives and 
other gateways, thereby 
This -ypcurnent does not specify the 

y . 

•nnecting several Cerberus buses:' 



Gateway Network 



sc. 



SCn 



SD 0 



Til 



ss 



Gateway Network 
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Each Cerberus bus in a Cerberus network may, for specification purposes, be 
assigned a unique network number, in the range 0.,2 l6 -l. These network numbers 
never appear directly in Cerberus device addresses, as the target network byte 
specified in the request packet of a Cerberus transaction contains only a relative 
net number: the target net either minus, or xor'ed with, the initiator net. Thus, the 
relative target network address is always zero when the initiator and the target are 
on the same Cerberus bus, and is always non-zero when they are on different 
buses. 

A Cerberus bus permits only one transaction to occur at : 
Cerberus network may have multiple simultaneous transgi 
target and initiator network addresses are all disjoint. ~ 
network addresses must satisfy the relations: 

targetj^%get^ 
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ons for simultaneous 
its of performance or 
& \are not satisfied, one 
>cal Cerberus bus on 





A Cerberus network may set 
transactions by its internal 
bandwidth of the gateway 
or more transactions 
which they are initiate* 



Each local Cerberus 
gateway. When, 
local Cerbei 
is non-zero, 
this transaction 
signed byte, rej*fy|j* 
target of the j^^oBoi 
to the loca^^&theru! 
bus, 



^network by exactly one 
iceive^d by a gateway on a 
number. If this byte 
e t'^ ^PS^ gateway, must carry 
^ Thfe^number is interpreted as a 
^..wayt and specifies a gateway to be the 
Besignate^e target gateway. We will refer 
!.r|pt'eway is attached as the initiator 
"^attacked as the target bus. 

"is carf|||} f Jtia th%fanitiator gateway, through the gateway 
. irget gateway, which then re-transmits the packet on the target 
v'hen the v request packet is re-transmitted on the target bus, the network 
nber byte is zero, designating a target on the target bus. The initiator gateway 
may delay transmission of the request packet on the initiator bus as required to 
limit or manage the flow of information through the gateway network, between 
each byte of the request packet. The initiator gateway must also delay 
transmission at the end of the last byte of the request packet in order to ensure 
that packet aborts on the target bus are propagated back to the initiator bus. The 
initiator gateway must also ensure that a target device which responds just barely 
within the time-out limit on the target bus does not cause a time-out on the 
initiator bus, generally by asserting a delay on the initiator bus until this condition 
can be assured. 

When a response packet is generated on the target bus (which may be from either 
the addressed target or some time-out generator), the packet is carried in the 
reverse direction by the gateway network. This response and any further packets 
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are carried until the end of the transaction. The contents of the response and 
further packets are not changed by the gateway network. - 

When a local Cerberus bus reset is received by a gateway, the reset is carried by 
the gateway network and each other gateway then re-transmits a reset transaction 
on all other local buses. 

Repeater 

A Cerberus bus may be extended by inserting repeaters. A 

separates two segments of a Cerberus bus, but providej%^. ...„. t 

between these two segments. Using a repeater is advantageous when the 
capacitive load or clock skew between Cerberus deyi&s "Sh a large bus would 
require a reduction in the clock rate. The system de^i^^%nM.ensure that device 
addresses remain unique across what is J^^^ly j|Jr" 



.... — electrically 
fsparent linkage 




Genera%^%#f|peater^i'lF repeat each request packet seen on one side of the 
repeater- onV;the ; ^ther side.; witfr^P dela^of at least one clock cycle. If two 
tranja%io1|Es tar^r^hearl^im^ltanl^sly^ on each side of the repeater, the 
«mM n ° ne °f ^^ ransact ^ s an d permit the other to be repeated. 
^ ^bitratiorMiust be performed fairly, such as by alternating which side of the 
>later is preferred on consecutive collisions. 

A simple repeater continues until the end of the transaction by repeating the 
response packets, which may appear on the same or opposite side as the original 
request packet of the transaction. 

If the topology of the Cerberus is constructed so that only target devices exist on 
one side of the repeater, the design may be simplified by the elimination of the 
arbitration function. In such a case, transactions may only originate from the side 
designated to contain initiator-capable devices. 

A more sophisticated repeater may "learn" which addresses are on each side of 
the repeater, and only repeat transactions which need to cross the repeater to be 
completed. Alternatively, a repeater may be constructed with knowledge of the 
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addresses to be placed on each side, such as addresses 0..127 on one side and 
addresses 128.. 255 on the other, again permitting the selective repeating of 
packets across the repeater. 

Synchronous Repeater 

A very simple form of repeater may be employed to divide up the capacitive and 
leakage load on the SD signal of a Cerberus bus into two or more segments, when 
a common SC clock reference is used. 



SD 0 



SQ L- 



Synchronous 



ous repeater 
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The synchronous repeater samples each electrically-isolated segment, of a 
logically-single Cerberus bus on the falling edge df each SC clock cycle, then 
broadcasts the logical AND of all the values on each segment during the SC clock 
low period. 




trge networks, this repeater improves performance by dividing up the RC 
lay by a factor of n, though two bus settling periods now occur on each SC clock 

period, so the speedup is approximately ^. 
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This circuit can be economically implemented using a single TTL '621 part and a 
pull-up resistor: 




*L J X 
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Icarus Interprocessor Protocol 

MicroUnity's Icarus interprocessor protocol uses Hermes high-bandwidth 
channels to connect Terpsichore processors together, either directly or through 
external switching components, permitting the construction of shared-memory, 
coherently- or incoherently- cached multiprocessors. Icarus uses Hermes in the 
"Dual-Master Pair" configuration, and can be extended for use in "Multiple- 
Master Ring" configurations. 

Internal daemons within Terpsichore perform and respon^§^Hermes write 
operations upon which the Icarus interprocessor commt^i^tion protocol is 
embedded. These daemons provide for the generation^ol^mlmory references to 
remote processors, for access to Terpsichore's local yfiysict memory space, and 
for the transport of remote references to o#ier iempt<^$te^}®;. 



Interprocessor Topok 

The simplest multiprocessor ^o^uguri 
protocols is a dual-processor: ^f^p ' 



built with the Icarus 




Dual-processor Terpsichore with Icarus interprocessor link 



The diagram below represents the same dual-processor system, in a simpler 
notation: 



MU 0023523 



Dual-processor Terpsichore with Icarus interprocessor link 



In the configuration above, a pair of Hermes channels are connected together to 
form an Icarus Interprocessor link in the Dual-Master Pair configuration. A 
Cerberus bus connects all the system components to gether to fa cilitate system 
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configuration. The Terpsichore processors all run off of a common frequency 
clock, as required by the Hermes channels that connect between processors. 

Dual Terpsichore processors with dual Icarus links may use both links to enhance 
system bandwidth: 



Dual-processor Terpsichore with Icarus interproce&gj&r Hfll 



A Terpsichore processor's dual Icarus links, each 
configuration may connect to two different proc^ 
Transponder daemons in each processor, se\§ 



'ual-Master Pair 
r sing the Icarus 
processors may be 




.., m ,,^%.^»- -f %w ,4*1! 

The Icarus links may also join a! the ends of the iiri'carihetw^k, forming a ring or 
arbitrary size. 4j ' ^ a 1J| 




Fou^procec-..-/, ? l o^ss'no^ w.t:. n trj.- mterprocessor links 



In die ^configuration ^above/;two^Icarut links are connected to each Terpsichore 
processor* forming n single ring. 
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By connecting Icarus Units into 4-master rings! "providing Hermes' master 
forwarding for responses, using the Icarus Transponder daemons in each 
processor, processors may be interconnected into a two-dimensional network of 
arbitrary size: 
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These building blocks can then be assembled into radix-n switching networks: 

By connecting Icarus links to external switching devices, multiprocessors with a 
large number of processors can be constructed with an arbitrary interconnection 
topology: 




lermes operations to 



he low-level packet protocol used 
aaster||levi^"ljn^^I^mes slave device. The packets that 
ion conll^yt three%it link-action identifier, or "lid," which 
; up to e%hf outstanding "link-actions to be in progress at any point in time. 

nk-actions consist of two actions. Each packet transmitted on the Hermes ring 
corresponds to an action: 



Request 


the action taken by a requester to start the transaction. 


Response 


the action taken by the responder to finish the transaction. 



Link-action nomenclature 
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These actions and their relatioH' to the data flow is shown below: • 




Transactions 
than eight actioi^ . 
desired throug^^^ 
link-action^rol^co 
impleme ' 




ie upper-level packet 
ransaction above the 

w may require that more 
in order to maintain the 
transaction protocol above the 
:- level state which must be 



that^akv up \ transaction contain an eight-bit transaction 
f> which v p*etmit up m256 outstanding transactions to be in 
- ..^point in time. These packets also contain link-action identifiers, 
1 connect these packets with others which are part of the transaction, 
ft do not contain a tid. 

Transactions consist of four actions. Each action results in one or more link-level 
Hermes packets , transmitted on the channels: 



Request 


the action taken by a requester to start the transaction 


Indication 


the reception of a request by a responder. 


Response 


the action taken by the responder to finish the transaction 


Confirmation 


the reception of the response by the requester 



WU 0023527 



For evaluation only 



Highly Confidential 

- >15 - microunity confidential 



Terpsichore System Architecture REDACTED 
These actions and their relation to the data flow is shown below: 



Requester 



J^guest 



Responder 




The following table sh^J^Jtrf BzIatMhy 
link-level actions, sUowk;.; ( gica 



:tfon-level actions and 
action commands: 



Transaction- 
level action. 



Request 



^-action command 



ffe-octlet 



Indication 



write-response 



write-octlet 



Confirrj^^^ 



write-response 



is Requester Daemon 
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Icarus Action Format - 
Request and Response antinrvt 

A series of link-level write octlet operations comprise an Icarus request or 
response action. The address of the write operation contains target routing, 
transaction-id, commands and sequence information in the following format: 



A remote request is a write octlet to an address of the form 



with data of the form: 
63 



The tid field contains an 8j 
with the remote respoi 
transactions original 
from distinct nodes. 

The com field 
the operation! 
response actio] 
the value of the cj 
such that the lagt%:* 

The node fiSLoni 



4?* 




ist be returned along 
unique among all 
;actions originating 



octlet, designates 
pSpesult returned in a 
\T, in successive octlets, 
: octlets to follow (0..9), 
field with a 0 value. 

is the target of the action. 
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When embedded into a link-level write octlet operation, the Terpsichore 
requester daemon request appears on the Hermes in the form: 



"Tld 



node7..p 



node 1S„? 



OCtlet 6 3..56 



OCtlet47... 




0 

link-level write octlet 
fester than one octlet may 
t:mitSg|ie payload. 



«isist ofyi sertfes of link-level write octlet 
£the Reg^Sst and Response actions. 



rattempti*%^oad orH&ore to a physical address in which the 
t iler l^%Ss are non-zero, the memory at that address is assumed to be 
feht in the memory space of a remote Terpsichore processor. The Icarus 
quester- Daemon is an autonomous unit which attempts to satisfy such remote 
memory references by communicating with an external device, either another 
Terpsichore processor or a switching device which eventually reaches another 
Terpsichore processor. 

These remote references are characterized by an eight-byte physical byte 
address, of which two bytes are used for specifying a processor node, and the 
remaining six bytes are used for specifying a local physical address on that 
processor node. 
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The Icarus Requester Daemon associates each remote^ memory reference with a 
transaction identifier 52 of eight bits, permitting up to 256 such remote references 
to be outstanding at any time; however, implementation limits within Terpsichore 
may set a smaller bound. 

The Icarus Requester Daemon takes the role of the Transaction Requester, and 
an external device takes the role of the Transaction Responder. The daemon 
generates writes to a specified byte-channel and module address, which causes an 
external device to read or write remote octlets or cache lines in gemote memory. 
The daemon may have as many as two 53 link-level write request^dtstanding at 
any point in time. 

Terpsichore contains two such requester daemons wh 
different byte-channel and/or module addresses. A 

Icarus Responder Dm 

The Icarus Responder Daemon 
module address, which enabl< 
to read or write octlets o\ 
generate Terpsichore evei 
same external device 
back to the externa^^ 




concurrently to two 



rified byte-channel and 
era|e transaction requests 
f| : local memory, or to 
ln^-level writes to the 
|se%ansaction requests 



fferent byi 



concurrently to two 
[uester, and the Icarus 



O'nder ^aejfon accepts writes from a specified Hermes 
: address^wlich enaMe. an external device to cause an Icarus 
Sh to generate a request on another Hermes channel and module 

Terpsichore contains two such transponder daemons which act concurrently 
(back-to-back) between two different byte-channel and/or module addresses. 



52 The term "sequence number" is avoided here, because the transaction-tags are not necessarily 
sequential in nature. 

53 The number of link-level requests to be outstanding is still under study. 
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Icarus Request 

The following table summarizes the commands used for Icarus requests and 
responses (response command shown in bold): 
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68.. 79 


Reserved 




80 


add-and-swap allocate strong octiet big-endian 


2 


81 


add-and-swap noallocate strong octiet biq-endian 


2 


82 


add-and-swap allocate weak octiet big-endian 


2 


83 


add-and-swap noallocate weak octiet big-endian 


2 


84 


compare-and-swap allocate strong octiet 


3 


85 


compare-and-swap noallocate strong octiet 


3 


86 


compare-and-swap allocate weak octiet 


3 


87 


compare-and-swap noallocate weak octiet . *%• 


& 


88 


multiplex-and-swap allocate strong octiet * 


3 


89 


multiplex-and-swap noallocate strong oo$ef%% 


3 < 


90 


n lulllfJlBA dl IvJ-oWdJJ dliOL>dlo WcdK OCllwl ^m. 


3 


91 


multiplex-and-swap noallocate weak^iM r ? 


3 


92 


multiplex allocate strong^tlet % 0" 


3 


93 


multiplex noallocate slfcng octb } " 


3 


94 


multiplex allocate weak octiet 


3 


95 


multiplex noallocate weak octiet 


3 


96-255 


reserved . * \ s 





A remote {add,swap; 
63 j jL 




:quest is data of the form: 



>te read Coherent {strong,weak} cache-line request is data, of the form: • 
63 Q 



address 
coherence tag 
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A remote write incoherent cache-line request is data of the form: 

63 0 

address 

bytes 0..7 

bytes 8.. 15 

bytes 16..23 



bytes 24..31 




A remote write {allocate,noallocate} {strong,weak} octlet request is data of the 



63 0 

address 

bytes 0..7 

64 
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A remote read {allocate,noaliocate} {strong,weak} hexlet request is data of the 
rorm: 



address 



A remote write {aUocate,noallocate} <strong,weak} hexlet request is data of the 
form: 



address 



bytes 0..7 



bytes 8.. 15 




Icarus Indication 

An Icarus Indication consist: 
level write issued as an 
contains the lid value oj 
link-level purpose of 
level requests and 
ability to receive^€diti|ri; 



onse^packet for each link- 
Tite-response packet 
:et;^his serves both the 
feive additional link- 
f the request and the 



./f cme^or x n|ore link-level write-octlet 
Hdresse#lo| the write operations contain 
contents read from memory. 

d responses from the Terpsichore 
the table below: 





%mmand 


payload 
(octlets) 




termination 




1..9 


continuation 




10.. 22 


Reserved 




23 


write response 


1 


24..31 


Reserved 




32 


read/add/swap octlet response 


2 


33 


read hexlet response 


3 


34 


read incoherent cache-line response 


9 


35 


read coherent cache-line response 


10 


7-255 | 


reserved 
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The com field contains an 8-bit message command, as given in the table 
previously. 

The tid field contains the 8-bit transaction id code used in the request message. 
The node field contains the 16-bit processor number used in the request message. 
A remote {read,add,swap} ocdet response is data of the form: 




A ^^^^r^^^ponse is dllWfef the foi 
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A remote read coherent cache-line response is data of the form: 
63 




packet for each 
,^.;el write-response 
r packet. This serves 
icating the ability to 
~ v «.«»«^«««-»;vel confirmation of 
to revive additional transaction-level 



The^^rus^puester, ResfSnder, and Transponder daemons must act 
% |fatively t#avoid deadlock that may arise due to an imbalance of requests in 
system which prevent responses from being routed to their destination. 

The requirements vary depending upon the characteristics of the system 
configuration, and the mechanisms for deadlock avoidance are still under study. 

Principal mechanisms to employ are cycle-free-routing of requests, and the means 
to prioritize responses above requests in forwarding priority. 

Error handling 

The link-level packets contain a check byte which is designed to detect single-bit 
transmission errors in the Hermes channel. 
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When either party in an Icarus transaction receives a packet with a check error, it 
immediately shuts down input processing to avoid encountering further errors, as 
may arise from errors which disrupt the parsing of packets. It also generates an 
error packet, which ensures that the other party is notified of the error. 

The target of an Icarus transaction must maintain a copy of the link-level address 
of the most recent correctly received link-level write operation in a Cerberus 
register. Terpsichore then will clear the error using the Cerberus channel, 
resetting the Hermes input processing. Each party then re-issues^any outstanding 
link-level transactions. 



The contents of the address field in the link-level protoi 
the error handling mechanism does not result in n ' 
This is important, because unlike the link-level ] 
protocol contains non-idempotent operations. 




:d to ensure that 
.repeated operations, 
tjte transaction-level 
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L.64 

G.MULADD.8 
L.64 

G.MULADD.8 
L.64 

G.MULADD.8 

A.ADD 

L.64 

G.MULADD.8 
L.64 

G.MULADD.8 
L.64 

G.MULADD.8 
G. COMPRESS. 16 
S.64 
A.ADD 

A. ADD 

B. NE 



r3,-1(r8) 
r4,r3,k10,r4 
r3,0(r8) 
r4,r3,k11,r4 
r3;l(r8) 
r4,r3,k12,r4 
r2,r8,row 
. r3,-1(r2) 
r4,r3,k20,r4 
r3,0(r2) 
r4,r3,k21,r4 
r3,1(r2) 
r4,r3,k22,r4 
r4,r4,8 
r4,0(r9) 
r8,8 
r9,8 

r8,r10,1b 

With some obvious reordering of tMnN^lresj 
run in 10 cycles, assuming smgle^cfsf%lat| 
can be used to handle greater L 
or 0.8 pixels/cycle. Counting %j3%multi 
add as 16 operations, 
cycles/loop = 13.6 c 



istructions, this can 
>D. Loop unrolling 
m cycles per eight pixels, 
nd each multiply and 
§perations/loop / 10 




Note that our desij 
use of "excess" 



J int8 *lst, ii 
rk01, inti 
int8 ^0^nt8 k11, int8 kT2, 
int8k2^int8k21, int8 k22) { 
%( {i=0; i!=4*pcount; i++) { 
dst[i] ' 



which is making good 



ide%p of pixels each 32 bits in 
Ij^ha, We treat each component 
offsets change slightly. A C 



(src[i-row-4]*k00 + src[i-row]*k01 + src[i-row+4]*k02 + 
src[i-4]*k10 + src[i]*k11 + src[i+4]*k12 + 
src[i+row-4]*k20 + src[i+row]*k21 + src[i+row+4]*k22)»8; 



The assembler coding of the inner loop is: 



A.SUB 
L.64 

G.MUL.8 
L.64 

G.MULADD.8 
L64 

G.MULADD.8 
L.64 



r2,r8,row 

r3,-4(r2) 

r4,r3,k00 

r3,0(r2) 

r4,r3,k01,r4 

r3,4(r2) 

r4,r3,k02,r4 

r3,-4(r8) 
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G.MULADD.8 


r4j3,k10,r4 


L.64 


r3,0{r8) 


G.MULADD.8 


r4 > r3,k11,r4 


L.64 


r3,4(r8) 


G.MULADD.8 


r4,r3,k12,r4 


A.ADD 


r2,r8,row 


L.64 


r3,-4(r2) 


G.MULADD.8 


r4,r3,k20,r4 


L.64 


r3,'o<r2) 


G.MULADD.8 


r4,r3,k21,r4 


L.64 


r3,4(r2) 


G.MULADD.8 


r4,r3,k22,r4 


G. COMPRESS. 128 


r4,r4,8 


S.64 


r4,0(r9) 


A.ADD 


r8,8 


A.ADD 


r9,8 


B.NE 


r8,r10,1b 



This uses the same algorithm as for 
performed at the same rate, but sin; 
rate is four times slower. The inn< 
pixels/cycle. 



Conversion of M 



To convert a monot 
monochrome pixel 
be set to a cons) * 

void Monochi 
int i; 

for (i=0; \b 
dst[i], 

dsJ{4*j+%= 
ds{{4l+2] 

JC 





^result! 
tead omitted 



the following inner loop (addressing operations and loop 
they do not influence the operation count): 



L.64.B 
G.SHUFFLE.16 
G.SHUFFLE.16 
G.SHUFFLE.8 
G.SHUFFLE.8 
S.128.B 
S.128.B 
A.ADD 

A. ADD 

B. NE 



r4,0(r8) 

r2,r4,r4 

r8,r4,r5 

r6,r2,r8 

r8,r3,r9 

r6 J 0(r9) 

r8,16(r9) 

r8,8 

r9,32 

r8,r10,1b 



#r5 contains -1 
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The above sequence is 4 cycles per 8 pixels, or 2.0 pixels/cycle. 

void MonochromeWithAlphaToColor(int8 *src, int8 *alpha, int8 *dst, int pcount) { 
int i; 
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for (i=0; i!=pcount; { "' " 
dst i] = src[i]; 
dst[4*i+1} = src[i]; 
dst 4*i+2]= src[i]; 
dst[4*i+3] = a!pha[i]; 

} 

Which results in the following inner loop: 



L.64.B 


r4,0(r8) 


L.64.B 


r5,0(r9) 


G.SHUFFLE.16 


r2,r4,r4 


G.SHUFFLE.16 


r8,r4,r5 


G.SHUFFLE.8 


r6,r2,r8 


G.SHUFFLE.8 


r8,r3,r9 


S.128.B 


r6,0(r10) 


S.128.B 


r8,16(r10) 


A.ADD 


r8,8 


A.ADD 




A.ADD 


r10,32 £ 


B.NE 


r8,r11^S 



The above sequence is 4 eye 

Conversion of Ook 




To convert a coloi 
green and blue 
selected so tlis 
weighted si 
overflow. 



lighted sum of the red, 
.ts^kO, kl, and k2, are 
>^scur. The resulting 
* * id the possibility of 



int8 k1, int8 k2) { 



Results IMthe following inner loop: 



L128.B 


r2,0(r8) 


G.DEAL.16 


r2,r2,r3 


L.128.B 


r4,16(r8) 


G.DEAL.16- 


r6,r4,r5 


G.DEAL.8 


r2,r2,r6 


G.DEAL8 


r4/3,r7 


G.MUL.8 


r6,r2,k0 


G.MULADD.8 


re.ra.kl.rG 


G.MULADD.8 


r6,r4,k2,r6 


G.COMPRESS.16 


r6,r6,8 


S.64 


r6,0(r9) 


A.ADD 


r8,32 


A.ADD 


r9,8 


B.NE 


r8,r10,1b 



>i+2)*k2)»8; 



#k0k1...k0k1k200...k200 



#k0k1...k0k1k200...k200 
#k0k0...k0k0k1k1...k1k1 
#k2k2...k2k20000...0000 



#toss away low precision 
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The code below performs the same action, but also saves the alpha value into a 
second destination array. 

void ColorToMonochrome(int8 *src, int8 *dst, intS *alpha, int pcount, 
int8k0, IntSkl, int8 k2) { 



for (i=0, i!=pcount; i++) { 

dst[i] = (src[4*i]*k0 + src[4*i+1]*k1 + srcI4*i+2]*k2)»8; 
afphafi] = src[4*i+3]; 

} 



} 

Which results in the following inner loop: 



L.128 
G. DEAL. 16 
L.128 
G. DEAL. 16 
G.DEAL.8 
G.DEAL8 
S.64 

G.MUL.8 
G.MULADD.8 
G.MULADD.8 
G.COMPRESS.m %6B,8 
S.64 
A. 




r2,0(r8) 
r2,r2,r3 
r4,16(r8) 
r6,r4,r5 
r2,r2,r6 
r4,r3.r7 % ^ 
r5,0(r1Q) 
r6,r2,kO 



,64 rB,0(r9) ' - 

'•ADD r8,32 



A.ADD r^JS J^£^ 

The above sequence sZ cycler tmd wrii 



A 



,ni'€f "V' 




iels/cycle. 
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Image Warping ^ ■ • 

Image warping is the general process of selectively stretching and shrinking an 
image to make it appear to fit into a new shape,, such as stretched around a 
sphere, or drawn on a surface that is tilted with respect to the viewing surface. A 
principal data structure used to generate such an effect is a set of decimated 
copies of the image, as shown in the diagram below. These are of particular value 
because interpolation of the elements of these copies produces a properly 
antialiased spatially- warped image. Note that the total size o%his. structure is 
always exacdy four times larger than the original image. Each s^kai^f is a copy of 
the image decimated in either the x or y direction, or both.Jgie^lttges get smaller 
and smaller going right and down in the array, until the ip|ge%fches a single dot. 
The original image need not be square or have sizes that%e%owers of two for this 
structure. , 

___ 



Original 
Image 

1:1 • 



1:2 



1:4 



2:1 



2:2 



2:4 



2:8 



4:1 



4:2 



4:4 



4:8 



8:1 



8:2 
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Image subarray packing for image warping 

In the sections below, we explore two parts of the problem, the creation of the 
array containing this decimated image, and the antialiased selection of items in the 
table. These are the parts of the process whi ch must be performed in real time for 
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real-time application of this process, the creation of the warping maps can often 
be precomputed, and are a function of the rendering system used. 

Decimation of Monochrome Image 

The process of generating the decimated images above can be divided into two 
parts, decimating in the horizontal direction only, and decimating in the vertical 
direction only. The former generates all the blocks to the right of the original 
image, and the latter generates the remaining blocks from those&in die top row. 
This divides the problem into two parts, each using one-djmeri^nal filtering, 
which is a great advantage because the amount of computa|kri^^s only linearly 
with the size of the filter function, rather than quadra^afi^when using two- 
dimensional filtering. 

a 5 -point filter, 
jhts, k0..k4, are 
? occur. The resulting 
avoid the possibility of 



Our first example is the one-dimensionafe^rizont^ 
specified by coefficients k0..k4 to sonify ih& 
selected so that k0+kl+k2+k3+k4 = " * & 
weighted sum is truncated, rathej 
overflow. 



void HorizontalDecimationMoi 

int8 kO, int8 k1 , in{g|k2, 
inti.j.k; 




int drow, int pcount, 



Which rei 

1- rbi 



; G Dl 

g.mC 

" G.MULADD.8 
L.128 
G.DEAL.8. 
G.MULADD.8 
G.MULADD.8 
L.128 
G.DEAL.8 
G.MULADD. 
G.COMPRESS.16 
S.64 
A.ADD 

A. ADD 

B. NE 
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This inner loop is 9 cycles per 8 pixels, or 0.9 Gpixels/sec, when the filter kernel 
size is 5 pixels wide. (For 3 pixels wide, the rate is 6 cycles per 8 pixels, or 1.3 
pixels/cycle.) 
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When decimating in the vertical direction, the rate h f eyen higher still: 

void VerticalDecimationMonochrome(int8 *src, int 8 *dst, int srow, int drow, int pcount 
int8 kO, int8 k1 , int8 k2, int8 k3, intS k4) { 
Int Ij.k; 

for (k=0,i=0; k!=pcount; ) { 
for (j=0; j!=drow; j++) { 

dst[k++] = (src[i-2*srow]*kO + src[i-srow]*k1 + src[i]*k2 + 
src[i+srow]*k3 + src[i+2*srow]*k4 )»8; 

i++; 

} 

i + = srow+ srow-d row; 



Which results in the following inner loop: 




:. (For 3 pixels wide, the rate is 

1 l%?decimated array shown above, for a n 2 image, n 2 pixels are 
pirated in the horizontal direction, and 2n 2 pixels are generated in the vertical 
direction. Using 5 pixel filter functions, this takes: n 2 /0.9 + 2n 2 /l 3 = 
n 2 *(l/0.9+2/1.3) = 2.63 *n 2 cycles. Thus, a 1024 2 image can be decimated in 2.8 
Mcycles. 

It is also possible to simultaneously decimate in the vertical and horizontal 
direction. While this may be more expensive that separately decimating in each 
direction, it permits the use of filter functions which do not factor into two parts. 
For this example, we assume a 2:1 decimation rate. in each direction, and a 3x3 
filter kernel. Real applications of decimation may use larger filter kernels, but this 
size serves to illustrate the techniques used. We assume here that pcount is a 
multiple of drow, and that drow<srow/2„ 

void DecimateMonochrome(int8 *src, int 8 *dst, int sro w. int drow. int nmunt ' T77" ' 

IrrtB kOO, int8 k01, int8 k02, ' : v ' ' MU 0023545 
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int8 k10, int8 k1 1, int8 k12, 
intS k20, int8 k21, int8 k22) { 
int i,j,k; 

for (k=0,i=0; k!=pcount; ) { 
for (j=0; j!=drow; j++) { 

dst[k++] = (src[l-srow-1]*k00 + src[i-srow]*k01 + src[i-srow+1j*k02 + 
src[i-1]*k10 + src[i]*k11 + src[i+1]*k12 + 
src[i+srow-1]*k20 + src[i+srow]*k21 + src[i+srow+1]*k22)»8; 

i+=2; 



} 

i+=2*(srow-drow); 

} 

} 

Assembler code for inner loop: 



r2,r8,srow 
r6,-1(r2) 
r6,r6.r7 
r4,r6,k< 
r4,r7, 




A.SUB 
L.128 
G.DEAL.8 
G.MUL.8 
G.MULADD.8 
L128 
G.DEAL.8 
G.MULADD.8 

G.MULADQ.6 r4,r6,!< . 
C.MULADD 3 f ,4>7,k11,r4 

L.128 ^ . . rs.HrS) 

C DEAll-8 i6,re.r7 

G Mi ft ADD <3 f4,r6,k12,r4 

A.ADD r2,F8,srow 

L.128 r6,-Kr2) - 

G DEAL 8 r6.rb.rv 

G MliLApD 8 r4 re.kQQ, 4 

G.MULADD8 r4,r7,k0l,r4 
%MiP _ '% 
G DEALjT\ 
G MULADD a 
G COMPRESS 16 
S64~%f 
9 A.ADD 

A. ADD 

B. NE 




W U 0023546 



After some reordering of the address calculation instructions, the inner loop is 16 
cycles per 8 pixels, or 0.5 pixels/cycle. Note that for 2:1 decimation in each 
direction, this is 4 times larger when expressed in terms of the input pixel rate: 2.0 
pixels/cycle. 

Because the filter function is an odd-number of pixels wide, 1/4 of the multiply 
bandwidth is effectively unused. For a 5x5 filter function, this would drop to 1/6 
unused, and for an even number of pixels wide, none would be wasted. Compared 
to the two-dimensional filtering case, the multiplier bandwidth is less utilized 
because the index multiplier required the additional DEAL operations to be 
added. 
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Decimation of Color Ifr^nn . ■•• 

Our first example is the one-dimensional horizontal filter. We use a 5-point filter, 
specified by coefficients k0..k4 to specify the filter. These weights, k0..k4, are 
selected so that k0+kl+k2+k3+k4 = 256, so overflow does not occur. The resulting 
weighted sum is truncated, rather than rounded, again, to avoid the possibility of 
overflow. . 

void HorizontalDecimationColor(int8 *src, int 8 *dst, int srow, int draw, int pcount 
int8 kO, intS k1, int8 k2, inta k3, int8 k4) { 
int i,j,k; 

for (k=0,i=0; k!=pcount; ) { 
for (j=0; j!=drow; j++) { 

dst[k++] = (src[i-8]*k0 + src[i-4]*k1 + src[i]*k 
src[i+4]*k3 + src[i+8]&4 

dst{k++] = (src[i-8]*k0 + s^U]*\<r 
src[l+4]*k3 + J%fl1l|*k4< 

[k++] = (mNnR! 1 ^ 




► G.MC31 
L.128 
G.DEAL.32 
G.MULADD.8 
G.COMPRESS.16 
S.64 
A.ADD 

A. ADD 

B. NE 



!,r4 

r4,r7,k3,r4 
r6,8(r8) 
r6,r6,r7 
r4,r6,k4,r4 
r4,r4,8 
r4,0(r9) 
r8,16 
r9,8 

r8,r10,1b 



M U 0023547 



This inner loop is .9 cycles per 2 pixels, or 0.2 pixels/cycle, when the filter kernel 
size is 5 pixels wide. (For 3 pixels wide, the rate is 6 cycles per 2 pixels, or 0.3 
pixels/cycle.) 

When decimating in the vertical direction, the rate is even higher still: 

void VerticalDecimationColor(int8 *src, int 8 *dst, int srow, int draw, int pcount 
int8 kO, int8 k1 , int8 k2, intS k3, int8 k4) { , n 
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int i.j,k; 

for {k=0,i=0; k!=pcount; ) { 
for (j=0; j!=4*drow; j++) { 

dst[k++] - (src[i-8*srow]*k0 + src[i-4*srow]*k1 + src[i]*k2 -i 
src[i+4*srow]*k3 + src[i+8*srow]*k4 }»8; 



i+=4*(srow+srow-drow); 



Which results in the following inner loop: 




A.SUB 
L.64 

G.MUL.8 

A.SUB 

L64 

G.MULADD.8 
L.( 

G.MULADD.8 

A.ADD 

L.64 

G.MULADD.8 

A.ADD 

L.64 

G.MULADI 
G.COMP8I 



"pixels wide, the rate is 

>r a n 2 image, n 2 pixels are 
els are generated in the vertical 
this takes: n 2 /0.2 + 2n 2 /0.3 = 
a 10242 image can be decimated in 11 



lf§e' last example in this section decimates a color signal in both directions 
simultaneously. We assume a 2:1 decimation rate in each direction, and a 3x3 filter 
kernel. Real applications of decimation may use larger filter kernels, but this size 
serves to illustrate the techniques used. We assume here that pcount is a multiple 
of drow, and that drow<srow/2.. 

void DecimateColor(int8 *src, int 8 *dst, int srow, int drow, int pcount, 
int8 kOO, int8k01, intS k02, 



To general 
generated^ — 
direction. "Usin^ 7^ pixel? 
n 2 *(l/&2*2/03) « I0 5'n< 
M. 




int8 k10, int8k11, int8 k12, 
int8 k20, int8 k21, int8 k22) { 
int i,j,k; 



MU 0023548 



for {k=0,i=0; k!=4*pcount; ) { 
for {j=0; j!=drow; j++) { 

dst[k++j=(src[i-4*srow-4]*k00 + src[i-4*srow]*k01 • 
src[i-4]*k10 + src[i]*k11 + src[i+4]*k12 + 
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"6w+4]*k22)»8; 



src[i+4*srqw-4]*k20 + src[i+4*srowJ*k21 + src[i+4*srow+4]*k22)»8; 

- " 

dst[k++]=<src[i-4*srow-4]*k00 + srcfi-4*srow]*k01 + src[i-4*srow+4l*k02 + 
src[i-4]*k10 + src[i]*k11 + src{i+4]*k12 + 

src[i+4*srow-4]*k20 + src[i+4*srow]*k21 + src[i+4*srow+4]*k22)»8; 
i++; ■ ' 

dst[k++Hsrc[i-4*srow-4]*k00 + src[i-4*srow]*k01 + srcfi-4*srow+4]*k02 + 
src[i-4]*k10 + src[i]*k11 + src[i+4]*k12 + 

src[i+4*srow-4]*k20 + src[i+4*srow]*k21 + src[i+4*srow+4]*k22)»8- 

dst[k++]=(src[i-4*srow-4]*k00 + src[i-4*srow]*k01 + src^*sfow+4]*k02 + 
. src[M]*k10 + src[i]*k11 + src[i+4]*k12 + -Sv^'V 
src{i+4*srow-4]*k20 + src[i+4*srow]*k21 + ! 
i+=5; 

i+=4*(srow+srow-drow-drow); 



Assembler code for inner loop: 

A. SUB 
L.128 
G.DEAL.32 
G.MUL.8 
G.MULADD.8 
L.128 
G.DEAL.32 
G.MULAI 
L128 




G.DEAL.32 
G.MULADD.8 
G.COMPRESS.16 
S.64 
A.ADD 

A. ADD 

B. NE 



r6,4(r2) 
r6,r6 

r4,r6,k02,r4 
r4,r4,8 
r4,0<r9) 
r8,16 
r9,8 

r8,r10,1b 



After some reordering of the address calculation instructions, the inner loop is 16 
cycles per 2 pixels, or 0.12 pixels/cycle. 

Fractional Interpolation 



This section is under construction. 
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Image Compression Applications 

The following examples demonstrate key portions of JPEG and MPEG image 
compression applications. Both JPEG and MPEG applications rely on the use of a 
2-dimensional Discrete Cosine Transform (DCT) to transform raster-image data 
into a frequency-based representation that is more amenable to entropy coding. 

The following examples demonstrate several applications, listed below in summary 
form with the performance estimated. The estimates assume s^glg.-cycle loads 
and stores, that is, they do not account for losses due to cac^^^es. However, 
the memory reference patterns are very uniform, and with^^^^ning, they could 
be kept invisible. 




jecl oh an 8-by-8 match of data by doing a 
l of the 8 rows of" the matrix, and on each of 
^eanllto imp%rnent these operations is to 
f of theltoatrix, transpose the matrix, then 
ospcApe matrix again. 



uctions i 
; matrix are shuffled Iog2N times. 



iume the matrix originally is in the order: 



0 1 2 3 4 5 6 7 

8 9 10 11 12 13 14 15 

16 17 18 19 20 21 22 23 

24 25 26 27 28 29 30 31 

32 33 34 35 36 37 38 39 

40 41 42 43 44 45 46 47 

48 49 50 51 52 53 54 55 

56 57 58 59 60 61 62 63 J 



using 
first and 



MU 0023550 



54 Stone, Harold, "Parallel Processing with the Perfect Shuffle," IEEE Transactions on 
Computers, Vol C20, No. 2, February 1971, 153 
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After one shuffling, the matrix is in the order: 

:?,?'kv -- r. 
0 32 1 33 2 34 3 35 
4 36 5 37 6 38 7 39 
8 40 9 41 10 42 11 43 
12 44 13 45 14 46 15 47 
16 48 17 49 18 50 19 51 
20 52 21 53 22 54 23 55 
24 56 25 27 26 58 27 59 
L 28 60 29 61 30 62 31 

After a second shuffling, the matrix is in the order: 



After a third shufBmg,^^^px^ 




} 



for (i=0; i<32; i++) { tm0[2*i] = srcfi]; tm0[2*i+1] = src[i+32]; ] 
for {i=0; i<32; i++) { tm1[2*i] = tm0[i]; tm1[2*i+1] = tm0[i+321; } 
for (i=0; i<32; i++) { dst[2*i] = tm1[i]; dst[2*i+1] = tm1[i+32]; } 



Assembler code for procedure: 



_Matrix8By8Transpose: 
L.128.1 
L.128.1 

G.SHUFFLE.8 
L.128.1 

G.SHUFFLE.8 
L.128.1 

G.SHUFFLE.8 
L.128.1 



r4,r2,0 

r12,r2,64 

r20,r4,r12 

r6,r2,16 

r22,r5,r13 

r14,r2,80 

r24,r6,r14 

r8,r2,32 



# 00 01 02 03 04 05 06 07 

# 32 33 34 35 36 37 38 39 

# 00 32 01 33 02 34 03 35 
#08 09 10 11 12 13 14 15 

# 04 36 05 37 06 38 07 39 

# 40 41 42 43 44 45 46 47 
#08 40 09 41 10 42 11 43 

# 16 17 18 19 20 21 22 23 
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r26 ) r7,r15 

r16,r2,96 

r28,r8,r16 

r10,r2 ) 48 

r30,r9,r17 

r18,r2,112 

r32,r10,r18 

r34,r11,r19 



# 12 44 13 45 14 46 15 47 

# 48 49 50 51 52 53 54 55 

# 16 48 17 49 18 50 19 51 

# 24 25 26 27 28 29 30 31 

# 20 52 21 53 22 54 23 55 

# 56 57 58 59 60 61 62 63 

# 24 56 25 57 26 58 27 59 

# 28 60 29 61 30 62 31 63 

# 00 16 32 48 01 17 33 49 A 

# 02 18 34 50 03 19 35 51*% 

# 04 20 36 52 05 21 37f3 

# 06 22 38 54 07 2 
#08 24 40 56 09 i 

# 10 26 42 58 1C 

# 12 28 44 6 
#t4 30 46J 



G.SHUFFLE.8 
L.128.1 

G.SHUFFLE.8 
L.128.1 

G.SHUFFLE.8 
L.128.1 

G.SHUFFLE.8 
G.SHUFFLE.8 

G.SHUFFLE.8 
G.SHUFFLE.; 
G. SHUFFLE.: 
G.SHUFFLE.8 
G.SHUFFLE.8 
G.SHUFFLE.8 
G.SHUFFLE.8 
G.SHUFFLE.8 

G.SHUFFLE.8 

S.128.1 
G.SHUFFLE.8 

S.128.1 
G.SHUFFLE.8 

S.128.1 
G.SHUFFLE.8 

S.128.1 
G.SHUFFLE.8,; 

S.128.1 
G.SHUFJ 

S.1 
G.SI 

S.121 
G.SHUFFJ 

S.1 



^ flowing $$6de is based upon the Independent JPEG Group's software 
: £klct.c' >55 , using 16-bit multiplies generating a 32-bit result. 
#include "jinclude.h" 

#define RIGHT_SHlFT(x,shft) ((x) » (shft)) 

#define LG2_DCT_SCALE 15 /* lose a little precision to avoid overflow */ 

#define ONE {(INT32) 1) 

#define DCT_SCALE (ONE « LG2_DCT_SCALE) 

I* In some places we shift the inputs left by a couple more bits, */ 
/* so that they can be added to fractional results without too much */ 
I* loss of precision. */ 

#deftne LG2 OVERSCALE 2 MU 0023552 

#define OVERSCALE (ONE « LG2_OVERSCALE) 
#define OVERSHIFT(x) ((x) «= LG2_OVERSCALE) 




55 Copyright (C) 1991, Thomas G. Lane. 
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I* Scale a fractional constant by DCT_SCALE 7 
#define FIX(x) ((INT32) ((x) * DCT.SCALE + 0.5)) 

F Scale a fractional constant by DCT_SCALE/OVERSCALE 7 
/* Such a constant can be multiplied with an overscaled input 7 
/* to produce something that's scaled by DCT_SCALE 7 
#define FIXO(x) ((INT32) ((x) * DCT.SCALE / OVERSCALE + 0.5)) 

/* Descale and correctly round a value that's scaled by DCT_SCALE 7 
#define UNFfX(x) RlGHT„SHIFT((x) + (ONE « (LG2_DCT_SCALE-1 », L<S 

/* Same with an additional division by 2, ie, correctly rounded UNFI& 
#define UNFIXH(x) RIGHT_SHIFT((x) + (ONE « LG2_DCT_S-~ " ' ~ 




T_SCALE) 



/* Take a value scaled by DCT_SCALE and round to integers 
#define UNFIXO(x) RIGHT_SHIFT((x) + (ONE <%(LG2_D©^ 
LG2_DCT_SCALE-LG2_O^Sff 

/* Here are the constants we need 7 
/* SINjJ is sine of i*pi/j, scaled by 
/* COS_iJ is cosine of i*pi/j, scaled 

#define SIN_1_4 FIX(0.70710) 
#define COS_1_4 SIN_1_4 ( 

#define SIN_1 8 FIX(< 
#define COS_i" 8 FC " 
#define SIN_3_8 
#define COS_3„ 

#define SIN_1_ 
#define COS_1_16 
#define SIN_7_1 
#define COS_7#kSI 



LU is sinefbf i*pi/j, scaled by DCT_SCALE/OVERSCALE 7 
)COS_iJ is cosine of i*pi/j, scaled by DCT_SCALE/OVERSCALE 7 

#define 0SIN_1_4 FIXO(0.7071 06781) 
#define OCOS_1_4 OSIN_1_4 

#define OSIN_1_8 FIXO(0. 382683432) 
#define OCOS_1_8 FIXO(0.923879533) 
#define OSIN_3_8 OCOS_1_8 
#define OCOS_3_8 OS!NJ_8 

fdefine OSiN_1_16 FIXO(0. 195090322) 
#define OCOS_1_16 FIXO(0.980785280) 
#define OSIN_7_16 OCOS_1_16 
#define OCOS_7_16 OSIN_1_16 

#define OSIN_3_16 FIXO(0.555570233) ' 



PCT_SCALE+1) 



by ^OVERSCALE 7 
" ^ 12_OVERSCALE)),\ 
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#define OCOS_3_16 FIXO(0.831469612) 
#define OSIN_5_16 OCOS_3_16 
#define OCOS_5_16 OSIN_3_16 



* Perform a 1 -dimensional DCT. 

* Note that this code is specialized to the case DCTSIZE = 8. 



INLINE 
LOCAL void 

fast_dct_8 (DCTELEM *in, int stride) 
{ 

/* many tmps have nonoverlapping lifetime 
* should be able to do this lot very well 

7 

INT16 inO, in1, in2, in3, in4, in5, in6, in7; 
INT16 tmpO, tmp1, tmp2, tmp3, tmp4, tm| 
INT16 tmp10, tmp1 1, tmp12, tmp13; * 
INT16 tmp14, tmp15, tmp16, tmp17^ , 
INT16 tmp25, tmp26; 

inO = in[ 0]; 
in1 = Infstride ]; 
in2 = in[stride*2]; Jr 
in3 = in[strlde*3]; 
in4 = infstrideM]; |[ ^ 
ir>5 = in[stride"5j; " 
in6 = in[stride*6jtW " J 
in? = infsrride-?}, ■ 

tmpO = in7 + inO; ^Jr% 
tmp1 = in6 + int#^^^ 
tmp2 = in5 + fefiL ^ \ 
tmp3 = in4 + in3; M> 
tmp4 = irt|%ni 
tmp5 = jn2 - Ktft 
tmu6 - if 1 *! - ir,6; 
\rup7 ^mO- in7, 

tmptC = tmp3 + tmpO, 
^lp11 = tmp2 + tmp1; 
trhp12 = tmp1 - tmp2; 
tmp13 = tmpO - tmp3; 



in[ 0] = (DCTELEM) UNFIXH((tmp1 0 + tmp1 1 ) * SI N_1 _4); 
in[stride*4] = (DCTELEM) UNFfXH((tmp10 - tmp11) * COS_1_4); 




irtfstride'^] = 
in[stride*6] = 



(DCTELEM) UNFIXH(tmp13*COS_1_8 + tmp12*SIN_1_8); 
(DCTELEM) UNFIXH(tmp13*SIN_1_8 - tmp12*COS_t_8); 



tmp16 = UNFIXO((tmp6 + tmp5) * SIN_1_4); 
tmp15 = UNFIXO((tmp6 - tmp5) * COS_1_4); 



OVERSHIFT(tmp4); 
OVERSHIFT(tmp7); 
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/* tmp4, tmp7, tmp15, tmp16 are overscaled by OVERSCALE ,*/ 

tmp14 = tmp4 + tmp15; 
tmp25 = tmp4 - tmp15; 
tmp26 = tmp7 - tmp16; 
tmp17 = tmp7 + tmp16; 



in[stride ] = (DCTELEM) UNFIXH(tmp17*OCOS_1_16 + 
in[stride*7] = (DCTELEM) UNFIXH(tmp17*OCOS_7_16 • 
in[strtde*5] = (DCTELEM) UNFIXH(tmp26*OCOS_5_16 • 
in[stride*3] = (DCTELEM) UNFIXH(tmp26*OCOS_3_1 6 ■ 



* Perform the forward DCT on one block of samples 

* A 2-D DCT can be done by 1-D DCT on ( 

* followed by 1-D DCT on each column. * 



GLOBAL void 

j_fwd_dct (DCTBLOCK data) 
int i; 



for (i m 0; i < DCTSIZI 
fasLdct_8(data+r" 



for (i = 0; i < 
fasLdct_8(( 



The a$sembl< 



tmp14*OSIN_1_16); 
tmp14*OSIN 7_16); 
tmp25*OSIN_5_16); 
tmp25*OSIN_3^ - 




stride=8, is as follows: 



> L.12 „ 
L. 128.1 
L.128. 
G.ADD.16 
G.ADD.16 
G.ADD.16 
G.ADD.16 
G.SUB.16 
G.SUB.16 
G.SUB.16^- 
G.SUB.16 
G.ADD.16 
G.ADD.16 
G.SUB.16 
G.SUB.16 
G.ADD.16 
G.MULADD.16 
G.MULADD.16 



C64 
r14,r2,80 
r16,r2,96 
r18,r2,112 
r20,r18,r4 
r22,r16,r6 
r24,r14,r8 
r26,r12,M0 
r28,r10,r12 
r30,r8,r14 
r32.r6,r16 
r34,r4,r18 
r36,r26,r20 
r38,r24,r22 
r40,r22,r24 
r42,r20,r26 
r48»r36,r38 
r44,r48,$SIN. 
r46,r49,$SiN 



.,. 0}; ' 
m[stride ]; 
in[stride*2]; 
ln[stride*3]; 
in [stride *4]; 

# in5 = in[stride*5 ; 

# in6 = in[stride*6J; 

# in7 = in[stride*7 ; 

# tmpO = in7 + inO; 

# tmp1 = in6 + in1; 

# imp2 = in5 + in2; 

# tmp3 = in4 + in3; 

# tmp4 = in3 - in4; 

# tmp5 - in2 - in5; 

# tmp6 = in1 - in6; 

# tmp7 = inO - in7; 

# tmp10 = tmp3 + tmpO; 

# tmp11 = tmp2 + tmp1; 

# tmp12 = imp1 - tmp2; 

# tmp13 = tmpO - tmp3; 

# = tmp10 + tmp11 
.1_4,$32768 
.1_4,$32768 
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EXTRACT. 1. 16 

128.1 

SUB.16 

MULADD.16 

MULADD.16 

EXTRACT. 1. 16 

128.1 

MULADD.16 
MULADD.16 
MULADD.16 
MULADD.16 
EXTRACT.1.16 
128.1 

MULADD.16 
MULADD.16 
MULADD.16 
MULADD.16 
EXTRACT. i.1 6 
128.1 
ADD. 16 
MULADD.16 
MULADD.16 
EXTRACT.1.16 
SUB.16 
MULADD.16 
MULADD.16^ 
EXTRACT jj< 
SHL.16 *" 
SHL.16 
ADD - 



r44.r44.r46, 16 

r44,r2.0 # in[ 0] = ... 
r48.r36.r38 # =tmp10-tmp11 
r44,r48,$COS_l_4,$32768 
r46,r49,$COS_1„4,$32768 
r44.r44.r46. 16 

r44,r2,64 # in[stride*4] = ... 
r44, r42,$COS_1 _8,$32768 
r46, r43,$COS_1_8,$32768 
r44,r40,$SIN_1_8,r44 
r46,r41,$SlN_1_8,r46 
r44,r44,r46,16 
r44,r2,32 # in[stride*2] 
r44,r42,$SIN_1_8,$32768 
r46.r43,$SIN_1_8,$32768 
r44,r40,$-COS_1_8,r44 
r46,r41.$-COS 1 ' 
r44.r44,r46.1. 
r44.r2.96 
r48,r32,r/ A 
r44,r4f ' 
r46,R ' 
r44 




DD.1 

16" 

!8Jl ... 

>D.16 
MULADD.16 
MULADD.16 
EXTRACT. 1.1 6 
128.1 

MULADD.16 

MULADD.16 

MULADD.16 

MULADD.16 

EXTRACT. 1.1 6 

128.1 

MULADD.16 

MULADD.16 

MULADD.16 

MULADD.16 

EXTRACT. 1. 16 

128.1 



Hfstride} = ... 
7_16,$32768 
rr55.$OCOS_7_1 6.$32768 
r48,$-OSIN_7_16,r44 
r46,r49.$-OSIN_7_16,r46 
r44.r44.r46. 16 

r44, r2, 1 1 2 # in[stride*7] = . . . 
r44.r52,$OCOS_5_1 6.$32768 
r46,r53,$OCOS_5_1 6,$32768 
r44,r50,$OSIN_5_16.r44 
r46,r51,$OSIN_5_16,r46 
r44,r44.r46,16 

r44.r2.80 # in[stride*5] = ... 

r44,r52,$OCOS_3_1 6,$32768 

r46,r53,$OCOS_3_16,$32768 

r44,r50,$-OSIN_3_16,r44 

r46,r51.$-OSIN_3_16,r46 

r 44,r44,r46,16 

r44,r2.48 # in[stride*3] » ... 
rO 
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The above code uses 10 G.ADD, 10 G.SUB, 32 G.MULADD, 10 G.EXTRACT.I, 
and 2 G.SHL instructions^ which can be scheduled in 64 cycles. This code 
performs 8 1 -dimensional DCTs at once, so it can be described as performing at 
64/64 = 1.0 cycles/pixel. 

2 -Dimensional Discrote Cosing Tmns&rm 

The code for a 2 -dimensional DCT, uses the 1-dimensional DCT above, an 8x8 
transform, a second 1-dimensional DCT, and a second 1-dimen^pnal DCT. The 
load and store operations which are performed between $h^s^!eps can be 
eliminated by procedure inlining, so we can estimate the pe^fol|ilmee by counting 
the Group instructions alone, which total to 2*64+2*2j|Aff 6 cycles. The 2- 
dimensional DCT covers 64 pixels, which works out tola ra^e of 2.8 cycles/pixel 
An inverse DCT should have similar performance 

Floating-point Discrete Cosh 




^ (16-bit) floating-point 
ilatej:erms is performed 
»D instructions and 
is c ; a^be%emoved. Also, 10 of 
OL. Tp^ ^-Dimensional DCTs 
~|.M:ADD, 3 GF.MULSUB 
fdfttfc 2 -dimensional 8x8 DCT 
DCT should have 



The DCT can also be perfoi 
operations. In 
using half-precision floatin 
100% of the G.S" " 
the G.MULADD 
would use 10 GF. 
instructions, 
uses 2*36+2' 
similar perfoi 



y ied linear sequence of items, the 

el^^^^. This reduces the fixed-point 
^ll^el/pixel; the floating-point DCT cost 
1 :|%ies/pixel. 

towmg^ftion demonstrates that the transpose cost can be reduced to 16 
s v , by using a combination of memory loads and stores and the G.SHUFFLE 
terations, producing a fixed-point DCT in 2*64 + 16 = 144 cycles, or 2.3 
cycles/pixel and floating-point DCT in 2*36 + 16 = 88 cycles, or 1.4 cycles/pixel. 

Other Matrix Applications 
internal 4x4 Matrix Transpose 

This example details the transposition of a 4-by-4 matrix of 16-bit values, stored 
consecutively in memory. The calculation is performed entirely in registers, using 
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G.SHUFFLE instructions and a technique described in 56 , in which the first and 
second halyes of the matrix are shuffled log2N times. 

Assume the matrix originally is in the order: 



After one shuffling, the matrix is in the order: 



After a second shuffling, the matrix is^ 




C code for procedure: 

void Matrix4By4Tranipose(frt?16 *src, 
int16 tmO!16W- * - ^ 
int16 tm1p6j* 
inti; 



G.SHUFFLE. 8 

S.128.1 
G.SHUFFLE. 8 

S.128.1 

B 



r4,r8,r10 
r4,r3,0 
r6,r9,r11 
r6,r3,16 
rO 



01 02 03 04 05 06 07 
08 09 10 11 12 13 14 15 
00 08 01 09 02 10 03 11 
#04 12 05 13 06 14 07 15 

#00 04 08 12 01 05 09 13 

#02 06 10 14 03 07 11 15 



The resulting code transposes a 4-by-4 matrix i 



I 5 cycles. 



56 Stone, Harold, "Parallel Processing with the Perfect Shuffle," IEEE Transactions on 
Computers, Vol C-20, No. 2, February 1971, 153 
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External Matrix Transpose 

A large matrix may not fit in the register file all at once, and even if it could, the 
internal matrix transpose algorithm performs O(NlogN), as each doubling of the 
matrix size requires an additional shuffle. 

To support the transpose of a large matrix, the internal matrix transpose algorithm 
can be extended to transpose individual blocks, or sub -matrices, of a large matrix, 
by modifying the code to specify the row size of of the matrix. 57 



If we consider each element in the left matrix below to be a 
above, the transpose of the matrix is the right matrix bej 
of the right matrix is the transpose of the corresponding* ' 
Note that elements 0, 9, 18, 27, 36, 45, 54,jnd 63 are^ 
each of the other elements are transpose4jtnd exej * 
the matrix. Thus another useful exte^on%l 
transposes two submatrices simultalieoiijslyKwr'. i 
locations.58 




| s 8 submatrix as 
tere each element 
i in the left matrix, 
in-place, and that 
Opther element in 
hspose algorithm 
ack in exchanged 



0 12 3 4 
9 10 11 12 <s 

16 17 18 19 

24 25 26 2 

32 33 34 : 

40 41 
48 
56 



24^42 40 48 56 

41 49 57 

3%42 50 58 

5 43 51 59' 

:8 36 44 52 60 

19 §3 45 53 61 • 

M> \S.46 54 62 

47 55 63 



|natrix, which can be easily 
ting" the L.128.I and S.128.I 
jstructions. The cost of the 4x4 
transpose, so an external matrix 
"piaster than using the 8x8 submatrix 
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his section is under construction. 

Mnemosvne System Application 

MicroUnity's Terpsichore system architecture uses nine Mnemosyne memory 
devices in its base configuration, providing a nine byte-wide paths between the 
processor and memory. The memory devices are used to build a 0.5 Mbyte cache 
between Terpsichore's first level caches and DRAM-based main memory. The 

57 This modification uses A-type instructions to increment the src pointer by the row size between 
each L instructions, taking no additional cycles. 

58 For such a case, it is useful to use the indexed addressing form, so that the same index can be 
applied to the two pointers for the L and S instructions, rp ■■■ — % 
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main memory store consists of 9, 18 or 36 banks of 1Mx72 arrays (each bank is 
eighteen 4 Mbit DRAMs), which yields 64,128 or 256 Mbytes of ECC memory with 
8,16, or 32 Mbytes of directory storage. 





Terpsichore processor 


HiBRAM l/F 
9 x Mnemosyne 
^BM^yte cache 

jpMI/F 
DRAM array 




— 1 


A 


M 


M 

M 


A 

1 \ 


M 

1 1 




M 

1 i 


















































































Sy^em #>p 


cation 



To further expand the DRAM memory md improve the bandwidth to memory, 
two or four Moemosvne memory uev&es^^bV placed in each of the nine byte- 
wide paths' Such configurations me 18. *6;\72, or 144 banks of 1Mx72 arrays, 
which yields 128/256, 512, b L 10|4 MBytes of ECC memory with 16, 32, 64, or 128 
Mbyte- of directory storage 



Mnemosyne provide sufficient address bits to support up to 16Mx72 DRAM array 
banks, using as large as 64M bit DRAM parts when available. In such a 
configuration, memory sizes as large as 16 Gbytes of ECC memory can be 
constructed. 

Terpsichore uses a 64-byte cache line size. Each cache line is associated with an 
octlet (8 bytes) of directory information, using one of the nine "Hermes channels" 
provided by a Mnemosyne device with its associated DRAM. The remaining eight 
of the nine Hermes channels contain the eight octlets (eight byte units) of the 
cache line data. In order to provide the means to access individual octlets of cache 
data and directory information at maximum bandwidth, the directory information 
is scattered evenly among eight of the nine byte lanes. 
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Typical C erberus configurations 

The number of devices in atypical Cerberus bus may,, vary from a minimum of 
about 11 devices (8 Mnemosyne, 1 Terpsichore, 1 ^Calliope, 1 FPGA), to a 
moderate amount of about 40 (36 Mnemosyne, 1 Terpsichore, 1 Calliope, 1 Hydra, 
1 FPGA), or about 48 (36 Mnemosyne, 4 Terpsichore, 4 Calliope, 4 Hydra, 1 
FPGA) to a maximum of about 157 devices (144 Mnemosyne, 4 Terpsichore 4 
Calliope, 4 Hydra, 1 FPGA). . 



Terpsichore processor 




Minimum Je^^ores^XernM 



Moderate Terpsichore system application: 
one maximally-configured processor per board 
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Cerberus performance 



When determining the performance of Cerberus, this 15:1 variation in the 
number of devices on the bus has a critical effect. The performance of these 
configurations with a resistive termination is estimated below: 
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Index 

Absolute Maximum Ratings 201, 237 
Access detail required 

by global TLB 65, 129, 133, 138, 

142 

by local TLB 65, 129, 133, 138, 142 
by tag 65, 129, 133, 138, 142 
Access disallowed ' 

by global TLB 65, 129, 133, 138, 
141 

by local TLB 65, 129, 133, 138, 142 
by tag 65, 129, 133, 137, 141 
by virtual address 58, 59, 65, 129, 
133,137,141 
Address 51 
add 51 

immediate 54 
and 51 

immediate 54 
and not 51 
Copy Immediate^ 
exclusive nor % 
Immediate ' 

Reverse 
nand 

immedktej 

nor 



,^51 = 
Reversed 56 
r ^ shift left immediate 57 
Short Immediate 57 
signed shift right immediate 57 
subtract 56 

immediate 54, 55 
unsigned shifLright immediate 57 
xor51 

immediate 54 
Always Reserved 50 
architecture description registers 
186,214,254,286 
Arithmetic Operations 24, 33 
bandwidth 220 



bank 231 i.$ 
banks 228 ' 

bias current of each input operational 
amplifier. Each such field contains 
configuration data in the following 
format 261 
block diagram 201^ 
Branch 58, 59 ^ 
and 

equai|to%ro 60 

JF!J ero ^ 

equal to zero 60 
ithan zero 60 
»r equal to zero 60 
in zero 60 




ld60 
tingle 60 

iot equal 
double 60 
half 60 
quad 60 
single 60 
not unordered greater or' 
equal 

double 60 
half 60 
quad 60 
single 60 
not unordered or equal 

double 60 t — — 

half 60 \ MU 0023565 
quad 60 [ 
single 60 
not unordered or less 
double 60 
half 60 
g uad 60 
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single 60 
unordered greater or equal 
double 60 
half 60 
quad 60 
single 60 
unordered or equal 
double 60 
half 60 
quad 60 
single 60 
unordered or less 
double 61 
half 60 
quad 61 
single 60 
Gateway Immediate 64 
Immediate 66 
and link 66 
not equal 61 
signed 

greater or eqi 
less 61 
unsigned 
greater 
less 6. 
Branch Cond; 
byte ordering 275; 
Cache Coherencies 
Cache coh«^a& 
required 
by • 

14: 



g65, 
»pe 151, 351 

Lope's configuration registers 
comply with the Cerberus and 
Hermes specifications. Cerberus 
registers 241 
capacitance 204 
cascade 220 
cascaded 232 

Cerberus 187, 205, 215, 228, 241, 254, 
283, 287, 288, 289, 300, 301, 351 
Cerberus registers 175 
Cerberus status register 282 
check byte 275, 282 
checkpoint 173, 174 
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Clock 167, 270 
Clock Event 167 
clock rate 288 
collision 292, 293 
collisions 295 
command 276 
Compare-and-set 25 
configurable 221 
configuration 203, 20423 9 
configuration regi^^C 265 
conflict 223 
consistency 2 





ach cable input 
:h such field 
iration data in the 
259 

each output 
such field 
ion data in the 
262 

tg Operations 27 

mk # oals270 

f capacity 200 
^216,217,218,219, 228, 

:al characteristics 202, 238, 

276 

•rror response 282 
Euterpe 198 
Exceptions 26 
Execute 67 
add 67 

and check signed overflow 67 
add immediate 71 

and check signed overflow 71 
and 67 

logarithm of most significant 
bit 67 

summation of bits 67 
and immediate 71 
and not 67 

Copy Immediate 70 ftftU 0023566 

exclusive nor 67 
gather 67 
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Immediate 71 

Reversed 73 
multiplex 79 

nand immediate 71 • * J 
nor immediate 71 
not and 67 
not or 67 
or 67 

or immediate 71 
or not 67 
Reversed 75 
scatter 67 
set 

equal 75 
not equal 75 
signed 

greater or equal 75 

less 75 
unsigned 

greater or equal 

less 75 
set immediate 
equal 73 
not equal |3l 



unsigned 

greater or equal 75 

. less75 
subtract* immediate 
and check 
equal 73 
not equal 73 
signed 

greater^ equal 73 
less 




^signed 

ihort Imm 
^ signed 

expand 67 

immediate 77 
shift right 67 
immediate 77 
subtract 75"" ^ 
and check 
equal 75 
not equal 75 
signed 

greater or equal 75 
less 75 
overflow 75 
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24, 80 
absolute value 
double 91 
half 91 
quad 91 
single 91 
add 

double 80 
half 80 
quad 80 
single 80 
convert 

double from integer 91, 92 
double from quad 91 
double from single 92 
half from integer 91 
half from single 91 
integer from double 92 
integer from half 92 
integer from quad 92 
integer from single 92 
1 fro m dou ble 92 
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quad from integer ' 

single from double 

single from half 92 

single from integer 
divide 

double 80, 81 

half 80 

quad 81 

single 80 
multiply 

double 81 

half 81 

quad 81 

single 81 
multiply and add 

double 89 

half 89 

quad 89 

single 89 
multiply and subtract 

double 89 

half 89 

quad 89 

single 89 ^fTl 
negate 

doubIe92 

half 92 

quad 92 

single W£%a 
Reversed.jP\ 
set A £#% 



91 



quad 84 
single 84 
not greater or equal 
double 84 
half 84 
quad 84 
single 84 
not or less 
double 84*$ 
half 84^ 




ible^O 



qual w 
single 84 
greater or equal 
double 84, 85 
half 84, 85 
quad 84, 85 
single 84, 85 
less 

double 84 
half 84 
quad 84 
single 84 
not equal 
double 84 
half 84 
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^Cfefe 85 

unordered greater or equal 
g%S> double 85 
s ^ half 85 
quad 85 
single 85 
unordered or less 
double 85 
half 85 
quad 85 
single 85 
square root 
double 93 
half 92, 93 
quad 93 
single 93 
subtract 
double 85 
half 85 
quad 85, 86 
single 85 
Ternary 89 
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Unary 91 
Floating-point arithmetic 83, 88, 90, 
95, 116, 120, 122, 126 
Forced Perfect Termination 288 
FPGA 287 ■ ' 

Galois Field Operations 33 
Gateway 21 

Global TLB miss 65, 130, 134, 138, 
142 

Group 96 
add 

bytes 96 

doublets 96 

nibbles 96 

octlets 96 

pecks 96 

quadlets 96 
and 96 
and not 96 
compress 

bits 96 

bytes 96 

doublets 96 

immediat< 



quadlets 96 
exclusive-nor 98 
exclusive-or 98 
extract •• 

hexlet 111 
Extract Immediate 103 
bits 103 
bytes 103 
doublets 103 i 
hexlet 103 4 
nibbles J 
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™ ; „ r± Jex and gather 111 
scatter and multiplex 111 
' 97 
ft 91 
*or97 
or not 97 

polynomial divide 
bits 97 
bytes 97 
doublets 97 
nibbles 97 
octlets 97 
pecks 97 
quadlets 97 
Reversed 105 
scatter 
bytes 97 
doublets 97 
hexlet 97 
. nibbles 97 
octlets 97 

pecks 97 • 
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quadlets 97 

t 

equal 

bytes 105 

doublets 105 

nibbles 105 

octlets 105 

pecks 105 

quadlets 105 
not equal 

bytes 105 

doublets 105 

nibbles 105 

octlets 105 

pecks 105 

quadlets 105 
signed 

greater or equal 
bytes 105 
doublets 105 
nibbles 105 £ 
octlets lflfjlj 
pecks 105/' • 
quadlets 10") 

less 

byres I0S 
Sublets 10? - 
ni^b^s^^ 
octlets It); 
peeksT05^A* 



immediate 
bytes 108 
doublets 108 
nibbles 108 
ocdets 108 
pecks 108 
quadlets 108 

nibbles 97 

ocdets 97 % 

peci f, 97 

quadlets^^P 



Short Impidfe'108 
shuffle 





■€/\ 



^ greater or equal 

doublets 105 
nibbles 105 
ocdets 105 
pecks 105 
quadlets 105 
less 

bytes 105 
doublets 105 
nibbles 105 
ocdets 105 
pecks 105 
quadlets 105 

shift left 
bytes 97 
doublets 97 
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?2% V £'kits96 
bytes 96 
: doublets % 
immediate 
bits 108 
3 % ^ bytes 108 
% |%S doublets 108 
I i^lW nibbles 108 

octlet 108 
pecks 108 

^ quadlets 108 

nibbles 96 

octiet 96 

pecks 96 

quadlets 96 

multiply 

bits 97 

bytes 97 

doublets 97 0023570 
nibbles 97 W 
octlets 97 
pecks 97 
quadlets 97 
multiply and add 
bits and pecks 111 
bytes and doublets 111 
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doublets and quadlets HI 
nibbles and bytes 111 
ocdets and hexlets 111 
pecks and nibbles d 11 
quadlets and octlets 111 
shift right 
bytes 98 
doublets 98 
immediate 
bytes 108 
doublets 108 
nibbles 108 
ocdets 108 
pecks 108 
quadlets 108 
nibbles 98 
ocdets 98 
pecks 98 
quadlets 98 
subtract 
bytes 105 
doublets 106 
nibbles 105 
octlets 106 x ~ 
pecks 105 ^ V 




^ Egned 
expand 

bits 98 

bytes 98 

doublets 98 

immediate 
bits 108 
bytes 108 
doublets 108 
nibbles 108 
ocdet 108 
pecks 108 
quadlets 108 

nibbles 98 

ocdet 98 
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pecks 98 ' 
quadlets 98 
multiply 
v bytes 98 
doublets 98 
nibbles 98 
ocdets 98 
pecks 98 
quadlets S 
multiply t 



98 

lets 98 
iate 
:s 108 
>letsl08 
Ibbles 108 
ocdets 108 
108 
its 108 

^tji32s%8 8 

\^ ^ecks 98 

quadlets 98 
-xr- r floating-point 
'absolute value 
double 123 
half 123 

single 123 C 
add 

double 114 
half 114 
single 114 
convert 

double from integer 123 
double from single 123, 124 
half from integer 123 
half from single 123 
integer from double 124 
integer from half 124 
integer from single 124 
single from double 123 
single from half 1 23 
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single from integer 123 
divide 

double 114 

half 114 

single 114 
multiply 

double 115 

half 114, 115 

single 115 
multiply and add 

double 121 

half 121 

single 121 
multiply and subtract 

double 121 

half 121 

single 121 
negate 

double 124 

half 124 

single 124 

set 

equal 

double J. 
half h 



double 117 
half 117 
single 117 
not greater or equal 
double 117 
single 117 
half 117 
not less 

double 117 
single 117 
not less half 117 
not unordered greater or 
equal 

double 117 



half 117 
single 117 
not unordered or equal 
double 117 
half 117 
single 117 
not unordered or less 
double 117 
half 117 
single ] 

or equal 




118 
\ half 118 
jingle 118 
" mdwidth 204, 240 
ra 151, 351 
274, 276, 277 
implementation-defined parameters 
199,235,269 

implementation-dependent 219, 223, 
232,286 
interleave 220 
interleaving 198, 232 
internal buffer overflow 282 
invalid address 282 
invalid command 282 
invalid identification number 282 
latency 232, 285 
least-privileged level 143 
line 230 

Load 127 NIU 0023572 

hexlet 

big-endian 127 
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aligned 127 

immediate 131 
immediate 131 
little-endian 127 
aligned 127 

immediate 131 
immediate 131 
Immediate 131 
octlet 

big-endian 127 
aligned 127 

immediate 131 
immediate 131 
little-endian 127 
aligned 127 

immediate 131 
immediate 131 

signed 
byte 127 

immediate 131 
doublet 

big-endian 
aligned 




immediate 132 

octlet 

big-endian 128 
' ; : aligned 128 

immediate 132 
immediate 132 
little-endian 128 
aligned 128 

inj^diate 132 

quadlet ^Jig^ 
bi^Ati 128 
ka%ned 128 

lediate 132 
ite 132 
128 
, ied 128 
immediate 132 
lediate 132 
\ 

S 130, 134, 138, 142 
-.3,239 
-address 220 



immediate 13 
little cwclian 127 
aligned 127 

immediate 131 
immediate 131 

unsigned 
byte 127 

immediate 131 
doublet 

big-endian 127 
aligned 127 

immediate 131 
immediate 131 
little-endian 127 
aligned 128 

immediate 131 
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r »ent 143 
ichore 351 
>1, 351 

, configuration registers 

p> with the Cerberus and 
les specifications. Configuration 
, : %& .jters 205 
55 'Moderate Terpsichore 351 
most-privileged level 143 
multiprocessor 143 
noise 223 
octlet 204, 239 

operating modes of the cable input 
blocks. Each such field contains 
configuration data in the following 
format 259 

operating modes of the cable input 
equalizers. Each such field contains 
configuration data in the following 
format 263 
packaging 200, 235 

Page mode 223 f 

parity 275 MU 0023573 

partition 228 ■ ■ 
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physical memory blocks 228 
pins 200, 235 
Pipeline Organization 38 
PLL 218 

PMOS 193,218, 265 

power 189, 200, 224, 235, 256 

precharge 223 

process characteristics 193, 218, 265 
queue 223 
queued 220 
rank 228, 231 
read octlet 300 
read-allocate 276, 277 
read-noallocate 276, 277 
read-response 276, 278 
Recommended operating conditions 
201,237 

redundancy 198 
redundant 227, 228, 229, 231 
reserved 298 

Reserved Instruction 63, 6! 
90, 95, 102, 104, 107, l: 
120, 122, 126, 129, 
reset 171, 197, 22< 
Rounding 26 
SCI 150 
set on compare* 
side-effects 300 
single-set 203 
skew 189, 197 
256,268, 
slew 21j 

software 2|#T 241 

71, n; 

Agister 189, 205, 217, 218, 
240,241,255,256,283 
aiped218, 279 
Store 135 

add-and-swap octlet 
big-endian aligned 135 
big-endian aligned immediate 
139 

little-endian aligned 135 

little-endian aligned immediate 

139 
byte 135 

immediate 139 
compare-and-swap octlet 

big-endian aligned 135 



big-endian aligned immediate 
139 

little-endian aligned 135 
little-endian aligned immediate 
139 
double 

big-endian 135 
aligned 135 

" ite 139 





aligned 135 
aligned immediate 

ahgned 135 
^little-endian aligned immediate 
^139 

tultiplex-and-swap octlet 
big-endian aligned 135 
big-endian aligned immediate 
139 

little-endian aligned 135 
little-endian aligned immediate 
139 
octlet 

big-endian 135 
aligned 135 

immediate 139 
immediate 139 
little-endian 135 
aligned 135 

immediate 139 
immediate 139 
quadlet 

big-endian 135 
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aligned 135 

immediate 139 
immediate 139 
little-endian 135 , 
aligned 135 

immediate 139 
immediate 139 
Switching characteristics 203 , 239, 
274 

syndrome 218, 219 
Terpsichore 198, 234, 287, 349, 350. 
351,352,353 
testing 232, 286 
time-out 296, 300, 301 
timing 216, 221, 222, 233 
uncorrectable 217, 219 
write octlet 299 
write-allocate 276, 280 
write-back 203 
write-noallocate 276, 280 
write-response 276, 281 
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